None
Group 7: Rachel Butterfield, Justin Hamilton, Heber Jenson
Maverik will be able to implement the model created in order to predict daily sales for diesel fuel, unleaded fuel, merchandise, and food sales for the entire first year of a store's operation. This model needs to surpass the accuracy metrics used by the current Maverik model in predicting these 4 product sales.
Maverik, a convenience store and gas station chain with 300+ locations in the Western United States, aims to enhance its ability to assess the return on investment (ROI) for new stores, which are opened at an average rate of 30 per year. To achieve this, they require a new model capable of generating daily sales projections for merchandise, food sales, and total gallons sold for diesel and unleaded fuel for the entire first year of a store's operation. This data-driven approach will enable Maverik to closely monitor the performance of new stores, make informed decisions, and promptly address any under performing outlets. Maverik desires to know the daily sales for each of the 4 main products diesel fuel, unleaded fuel, merchandise, and food sales. Having these daily predictions allows for better planning for seasonality and setting realistic goals and expectations for what each new store can bring in revenue for the year.
The target variables is specifically daily sales for the diesel, unleaded fuel, merchandise, and food services. Represented in the time_series_data_msba.csv sets of time series sales data where the variables
The main level of accuracy that we will use to determine the performance of this model is RMSE and R Squared.
Predict the daily sales for each 4 products diesel, unleaded fuel, merchandise, and food service for each store for the following year based on the time series data.
Use a Prophet method to see the daily sales predictions for each of the 4 product categories for a new store.
The success of this project will be measured by its capacity to accurately predict sales metrics and its usability for future stores. Our team will be judging our model based on R-squared and RSME results. Since our target variables are numeric, we will explore which data features are significant via a linear regression model and then apply those features to a variety of data modeling techniques including a time series model. In this notebook we will be exploring and cleaning the data. Some of the questions we would like to answer are the following:
This notebook walks through the initial process of exploring the data Maverik provided and preparing the data for future modeling. The notebook explores missing data, and variable distributions and explains why variables have been removed or changed. In addition to the provided data from Maverik, we added additional data, that may be useful for estimation. This data included historical prices for oil, gallons of fuel, as well as temperature data for states where Maverik currently operates. The notebook concludes with a final data set that can be used for future model development.
The primary objectives of this EDA notebook are as follows:
The data that was provided to us was comprised of two different data sets. The first data set contained time series data for 37 different Maverik locations. This data showed historical daily information for each store, including internal sales, food sales, total gallons sold diesel, and total gallons sold regular. For our model development, these will be the target variables our model will need to predict. The other variables in the time series data are used to identify the store and the date the information was obtained. The other data set Maverik provided contained a total of 54 variables, compromised of both quantitative and qualitative variables about each store. Variables include categorical variables such as if the store offered pizza or lottery tickets as well as quantitative variables including the store's square feet.
| Column | Description |
|---|---|
| capital_projects.soft_opening_date | The date this store opened |
| calendar.calendar_day_date | Date |
| calendar.fiscal_week_id_for_year | Fiscal Week Number |
| calendar.day_of_week | Day of the Week |
| calendar_information.holiday | Holiday Information for the date |
| calendar_information.type_of_day | Day Type |
| daily_yoy_ndt.total_inside_sales | Inside sales, everything that isn't made at the store |
| daily_yoy_ndt.total_food_service | Food service sales, everything that is made at the store |
| diesel | Diesel gallons sold |
| unleaded | all non-diesel gallons |
| site_id_msba | Unique Site Keys |
This data is focused on the qualitative dataset. This dataset includes categorical data of products sold and offered at Maverik stores. This also includes categorical data on services that are offered the potential Maverik store. This also includes continuous data of interior and exterior features of the store. This data set will be used to determine the products that are sold within the 4 product categories. This data will be used to track the product offerings and compared to new store when implemented within the model. There are 37 total stores listed with the information in this dataset.
This dataset comes with the assumption that at all Maverik locations there are unleaded gas tanks available.
| Column | Description |
|---|---|
| open_year | Year the site opened |
| square_feet | Square footage of store |
| front_door_count | Number of entrances |
| years_since_last_project | Years since last project |
| parking_spaces | Number of parking spaces |
| lottery | Lottery tickets offered? |
| freal | Freals offered? |
| bonfire_grill | Bonfire Grill included? |
| pizza | Pizza offered? |
| cinnabon | Cinnabon offered? |
| godfather_s_pizza | Godfather's Pizza offered? |
| ethanol_free | Ethanol free fuel offered? |
| diesel | Diesel fuel available? |
| hi_flow_lanes | High flow truck lanes available? |
| rv_lanes | RV lanes available? |
| hi_flow_rv_lanes | High flow RV lanes available? |
| def | Diesel exhaust fluid dispensed? |
| cat_scales | Truck weigh? |
| car_wash | Car wash on site? |
| ev_charging | Electric vehicle charging offered? |
| rv_dumps | RV dumps available? |
| propane | Propane offered? |
| x1_mile_pop | Population within 1-mile radius |
| x1_mile_emp | Number of individuals in workplace within 1-mile radius |
| x1_mile_income | Median income of 1-mile radius population |
| x1_2_mile_pop | Population within 1/2-mile radius |
| x1_2_mile_emp | Number of individuals in workplace within 1/2-mile radius |
| x1_2_mile_income | Median income of 1/2-mile radius population |
| x5_min_pop | Population within 5-mile radius |
| x5_min_emp | Number of individuals in workplace within 5-mile radius |
| x5_min_inc | Median income of 5-minute radius population |
| x7_min_pop | Population within 7-minute radius |
| x7_min_emp | Number of individuals in workplace within 7-minute radius |
| x7_min_inc | Median income of 7-minute radius population |
| traditional_forecourt_fueling_positions | Number of traditional forecourt fueling positions |
| traditional_forecourt_layout | Traditional position layout type (stacked, inline, etc.) |
| traditional_forecourt_stack_type | Traditional position stack type (if applicable) |
| rv_lanes_fueling_positions | Number of RV fueling positions |
| rv_lanes_layout | RV lane layout type (stacked, inline, etc.) |
| rv_lanes_stack_type | RV lane stack type (if applicable) |
| hi_flow_lanes_fueling_positions | Number of high flow fueling positions |
| hi_flow_lanes_layout | High flow lane layout type (stacked, inline, etc.) |
| hi_flow_lanes_stack_type | High flow lane stack type |
| hi_flow_lanes_fueling_positions_2 | NA |
| rv_lanes_fueling_positions_2 | NA |
| hi_flow_rv_lanes_layout | High flow RV lane layout type (stacked, inline, etc.) |
| hi_flow_rv_lanes_stack_type | High flow RV lane stack type (if applicable) |
| non_24_hour | Not open 24/7 |
| self_check_out | Self checkout available? |
| mens_toilet_count | Number of men's stalls |
| mens_urinal_count | Number of men's urinals |
| womens_toilet_count | Number of women's stalls |
| womens_sink_count | Number of women's sinks |
| site_id_msba | Unique site key |
q_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 37 entries, 0 to 36 Data columns (total 55 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 37 non-null int64 1 open_year 37 non-null int64 2 square_feet 37 non-null int64 3 front_door_count 37 non-null int64 4 years_since_last_project 37 non-null int64 5 parking_spaces 37 non-null int64 6 lottery 37 non-null object 7 freal 37 non-null object 8 bonfire_grill 37 non-null object 9 pizza 37 non-null object 10 cinnabon 37 non-null object 11 godfather_s_pizza 37 non-null object 12 ethanol_free 37 non-null object 13 diesel 37 non-null object 14 hi_flow_lanes 37 non-null object 15 rv_lanes 37 non-null object 16 hi_flow_rv_lanes 37 non-null object 17 def 37 non-null object 18 cat_scales 37 non-null object 19 car_wash 37 non-null object 20 ev_charging 37 non-null object 21 rv_dumps 37 non-null object 22 propane 37 non-null object 23 x1_mile_pop 37 non-null int64 24 x1_mile_emp 37 non-null int64 25 x1_mile_income 37 non-null int64 26 x1_2_mile_pop 37 non-null int64 27 x1_2_mile_emp 37 non-null int64 28 x1_2_mile_income 37 non-null int64 29 x5_min_pop 37 non-null int64 30 x5_min_emp 37 non-null int64 31 x5_min_inc 37 non-null int64 32 x7_min_pop 37 non-null int64 33 x7_min_emp 37 non-null int64 34 x7_min_inc 37 non-null int64 35 traditional_forecourt_fueling_positions 37 non-null int64 36 traditional_forecourt_layout 37 non-null object 37 traditional_forecourt_stack_type 37 non-null object 38 rv_lanes_fueling_positions 37 non-null int64 39 rv_lanes_layout 23 non-null object 40 rv_lanes_stack_type 23 non-null object 41 hi_flow_lanes_fueling_positions 37 non-null int64 42 hi_flow_lanes_layout 22 non-null object 43 hi_flow_lanes_stack_type 22 non-null object 44 hi_flow_lanes_fueling_positions_2 37 non-null int64 45 rv_lanes_fueling_positions_2 37 non-null int64 46 hi_flow_rv_lanes_layout 23 non-null object 47 hi_flow_rv_lanes_stack_type 23 non-null object 48 non_24_hour 37 non-null object 49 self_check_out 37 non-null object 50 mens_toilet_count 37 non-null int64 51 mens_urinal_count 37 non-null int64 52 womens_toilet_count 37 non-null int64 53 womens_sink_count 37 non-null int64 54 site_id_msba 37 non-null int64 dtypes: int64(28), object(27) memory usage: 16.0+ KB
q_data.head()
| Unnamed: 0 | open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | lottery | freal | bonfire_grill | pizza | ... | rv_lanes_fueling_positions_2 | hi_flow_rv_lanes_layout | hi_flow_rv_lanes_stack_type | non_24_hour | self_check_out | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2021 | 5046 | 2 | 2 | 38 | Yes | Yes | Yes | No | ... | 6 | Stack | HF/RV | No | Yes | 2 | 2 | 6 | 2 | 21560 |
| 1 | 2 | 2021 | 5046 | 2 | 2 | 39 | No | Yes | Yes | Yes | ... | 4 | Combo | HF/RV | No | Yes | 5 | 5 | 10 | 4 | 21980 |
| 2 | 3 | 2021 | 5046 | 2 | 2 | 35 | Yes | Yes | Yes | Yes | ... | 5 | In-Line | None | No | Yes | 3 | 2 | 4 | 1 | 22015 |
| 3 | 4 | 2021 | 5046 | 2 | 2 | 36 | No | Yes | Yes | Yes | ... | 4 | Combo | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 22085 |
| 4 | 5 | 2021 | 5046 | 2 | 2 | 25 | Yes | Yes | Yes | No | ... | 0 | NaN | NaN | No | Yes | 0 | 0 | 0 | 0 | 22120 |
5 rows × 55 columns
#qual DF summary stats
q_data.describe()
| Unnamed: 0 | open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | x1_mile_pop | x1_mile_emp | x1_mile_income | x1_2_mile_pop | ... | traditional_forecourt_fueling_positions | rv_lanes_fueling_positions | hi_flow_lanes_fueling_positions | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37.000000 | 37.000000 | 37.00000 | 37.0 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | ... | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 |
| mean | 19.000000 | 2021.324324 | 4970.27027 | 2.0 | 1.648649 | 37.405405 | 6703.567568 | 4757.648649 | 53300.378378 | 1833.108108 | ... | 14.270270 | 2.513514 | 3.324324 | 3.324324 | 2.513514 | 2.378378 | 2.351351 | 4.648649 | 1.702703 | 23040.405405 |
| std | 10.824355 | 0.474579 | 575.93121 | 0.0 | 0.483978 | 5.918237 | 5694.011350 | 4697.168291 | 24333.027254 | 1915.140476 | ... | 3.948619 | 2.049683 | 2.925501 | 2.925501 | 2.049683 | 0.923500 | 0.856875 | 1.751447 | 0.740303 | 730.069801 |
| min | 1.000000 | 2021.000000 | 2933.00000 | 2.0 | 1.000000 | 23.000000 | 0.000000 | 56.000000 | 0.000000 | 0.000000 | ... | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 21560.000000 |
| 25% | 10.000000 | 2021.000000 | 5046.00000 | 2.0 | 1.000000 | 34.000000 | 1984.000000 | 1771.000000 | 39538.000000 | 262.000000 | ... | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 4.000000 | 1.000000 | 22540.000000 |
| 50% | 19.000000 | 2021.000000 | 5046.00000 | 2.0 | 2.000000 | 38.000000 | 5574.000000 | 3895.000000 | 46356.000000 | 1003.000000 | ... | 12.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 2.000000 | 2.000000 | 4.000000 | 2.000000 | 22890.000000 |
| 75% | 28.000000 | 2022.000000 | 5046.00000 | 2.0 | 2.000000 | 41.000000 | 11269.000000 | 6002.000000 | 73519.000000 | 2686.000000 | ... | 16.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 3.000000 | 3.000000 | 6.000000 | 2.000000 | 23555.000000 |
| max | 37.000000 | 2022.000000 | 6134.00000 | 2.0 | 2.000000 | 49.000000 | 18692.000000 | 26077.000000 | 110957.000000 | 5923.000000 | ... | 24.000000 | 6.000000 | 9.000000 | 9.000000 | 6.000000 | 5.000000 | 5.000000 | 10.000000 | 4.000000 | 24535.000000 |
8 rows × 28 columns
The initial review of the summary statistics for the qual DF show that there are some store also located in larger populations and higher income areas. During the modeling process, I'll evaluate further how those values impact the model and if they need to be removed as outliers.
time_data.head()
| Unnamed: 0 | capital_projects.soft_opening_date | calendar.calendar_day_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 6/14/2022 | 6/17/2022 | 25 | Friday | NONE | WEEKDAY | 2168.2920 | 861.6930 | 722.7745 | 1425.1020 | 24535 |
| 1 | 2 | 6/14/2022 | 6/22/2022 | 25 | Wednesday | NONE | WEEKDAY | 2051.5635 | 808.0275 | 730.4850 | 1436.2740 | 24535 |
| 2 | 3 | 6/14/2022 | 6/23/2022 | 25 | Thursday | NONE | WEEKDAY | 2257.5000 | 966.4410 | 895.7970 | 1594.3725 | 24535 |
| 3 | 4 | 6/14/2022 | 6/26/2022 | 26 | Sunday | NONE | WEEKEND | 1520.5925 | 542.3250 | 584.2900 | 1124.9280 | 24535 |
| 4 | 5 | 6/14/2022 | 6/27/2022 | 26 | Monday | NONE | WEEKDAY | 1897.6930 | 771.4525 | 852.2605 | 1640.2540 | 24535 |
We can drop the Unnamed column from the data. This variable adds no values to our analysis and has no meaning.
time_data = time_data.drop('Unnamed: 0', axis= 1)
time_data.head()
| capital_projects.soft_opening_date | calendar.calendar_day_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6/14/2022 | 6/17/2022 | 25 | Friday | NONE | WEEKDAY | 2168.2920 | 861.6930 | 722.7745 | 1425.1020 | 24535 |
| 1 | 6/14/2022 | 6/22/2022 | 25 | Wednesday | NONE | WEEKDAY | 2051.5635 | 808.0275 | 730.4850 | 1436.2740 | 24535 |
| 2 | 6/14/2022 | 6/23/2022 | 25 | Thursday | NONE | WEEKDAY | 2257.5000 | 966.4410 | 895.7970 | 1594.3725 | 24535 |
| 3 | 6/14/2022 | 6/26/2022 | 26 | Sunday | NONE | WEEKEND | 1520.5925 | 542.3250 | 584.2900 | 1124.9280 | 24535 |
| 4 | 6/14/2022 | 6/27/2022 | 26 | Monday | NONE | WEEKDAY | 1897.6930 | 771.4525 | 852.2605 | 1640.2540 | 24535 |
Both these data sets come with minimal missing data sets. While the qualitative data set come with the values of None and NA, these are not considered missing data. As Maverik stated, the NA had a specific meaning of the feature not being applicable to being at the location. The data set with the NA and None values consider both the material that the store is not applicable with the product, showing that this individual score is missing this feature at the store. None values the information that the store has explicitly not been given this feature for justifiable reason. The values of NA and None do not signal the data being incomplete.
Example being that the variables non_24_hours and car_wash contain only the categorical values of NA and None. This is valuable information meaning that there is not a car wash at any Maverik location and all Maverik locations are open 24 hours.
We wanted to determine if there were any missing values in the provided data. On initial exploration we found that there were six columns that contained missing values. Upon further analysis, it was determined that these reported missing values were actually, the store reporting that they do not have a specific feature. For example, many of the stores did not have RV Lanes as apart of the store layout. The data reported this as a NA value. Once, the six column values were changed to an appropriate value, the data did not contain any other missing values.
Later in the report, we will add additional Macro variables to the data. By adding these features, we will introduce missing values. To address this issue, we used "last observation carried forward" and "next observation carried back" to impute missing data. This ensured the final data set contained no missing values.
def percentage_na_values_table(df):
# Sum of NA values in df
na_val = df.isna().sum()
# Percentage of NA values
na_val_perc = 100 * na_val / len(df)
# Create Table
na_col_table = pd.concat([na_val, na_val_perc], axis = 1)
# Sort by %NA descending
na_col_table = na_col_table[
na_col_table.iloc[:,1] != 0].sort_values(1, ascending = False).round(1)
# Add column names
na_col_table = na_col_table.rename(columns = {0: 'Total NA\'s in Column', 1: "Percentage NA"})
print("DF has " + str(df.shape[1]) + " columns.\nThere are " + str(na_col_table.shape[0]) + " columns that have missing values.")
# Return Table
return na_col_table
The results of the analysis show that there are 6 columns with missing values. on further inspection, Maverik confirmed that these values are not actually missing data. Instead, the values of NA are stating that that particular store does not contain that variable. This means we need to change the values from Na to a different string.
percentage_na_values_table(data).tail(50)
DF has 64 columns. There are 6 columns that have missing values.
| Total NA's in Column | Percentage NA | |
|---|---|---|
| hi_flow_lanes_layout | 5490 | 40.5 |
| hi_flow_lanes_stack_type | 5490 | 40.5 |
| rv_lanes_layout | 5124 | 37.8 |
| rv_lanes_stack_type | 5124 | 37.8 |
| hi_flow_rv_lanes_layout | 5124 | 37.8 |
| hi_flow_rv_lanes_stack_type | 5124 | 37.8 |
data.fillna('None', inplace = True)
percentage_na_values_table(data)
DF has 64 columns. There are 0 columns that have missing values.
| Total NA's in Column | Percentage NA |
|---|
Once we have changed the values the data from Maverik, no longer has any missing values.
time_data.site_id_msba.unique()
array([24535, 24255, 24220, 24150, 23905, 23835, 23765, 23730, 23660,
23555, 23485, 23450, 23415, 23380, 23345, 23240, 23135, 23065,
22925, 22890, 22855, 22820, 22785, 22750, 22715, 22680, 22645,
22575, 22540, 22505, 22400, 22330, 22260, 22120, 22085, 22015,
21980, 21560])
q_data.site_id_msba.unique()
array([21560, 21980, 22015, 22085, 22120, 22260, 22330, 22400, 22505,
22540, 22575, 22645, 22680, 22715, 22750, 22785, 22820, 22855,
22890, 22925, 23135, 23240, 23345, 23380, 23415, 23450, 23485,
23555, 23660, 23730, 23765, 23835, 23905, 24150, 24220, 24255,
24535])
The data sets do not have the same number of observations. The qualitative data is missing information for site_ID_MSBA 23065. We will remove this observation from the data when we join that data sets.
data = pd.merge(q_data, time_data, on = 'site_id_msba', how = 'inner')
data = data.drop('Unnamed: 0', axis = 1) # drop unneeded column
data.head()
| open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | lottery | freal | bonfire_grill | pizza | cinnabon | ... | capital_projects.soft_opening_date | calendar.calendar_day_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel_y | unleaded | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021 | 5046 | 2 | 2 | 38 | Yes | Yes | Yes | No | No | ... | 1/12/2021 | 2021-01-17 | 3 | Sunday | NONE | WEEKEND | 1792.1260 | 528.8185 | 870.0160 | 1490.3980 |
| 1 | 2021 | 5046 | 2 | 2 | 38 | Yes | Yes | Yes | No | No | ... | 1/12/2021 | 2021-01-20 | 3 | Wednesday | NONE | WEEKDAY | 2297.4595 | 793.7300 | 1582.0420 | 1750.7525 |
| 2 | 2021 | 5046 | 2 | 2 | 38 | Yes | Yes | Yes | No | No | ... | 1/12/2021 | 2021-01-21 | 3 | Thursday | NONE | WEEKDAY | 2381.5120 | 829.0730 | 1506.6660 | 1727.6210 |
| 3 | 2021 | 5046 | 2 | 2 | 38 | Yes | Yes | Yes | No | No | ... | 1/12/2021 | 2021-01-24 | 4 | Sunday | NONE | WEEKEND | 1396.4370 | 456.1830 | 510.9370 | 1184.8060 |
| 4 | 2021 | 5046 | 2 | 2 | 38 | Yes | Yes | Yes | No | No | ... | 1/12/2021 | 2021-01-29 | 5 | Friday | NONE | WEEKDAY | 2568.2370 | 782.6245 | 1604.0325 | 1987.2720 |
5 rows × 64 columns
How do the features correlate to each other? What is the distribution of each of the variables?
# Function to output variance and number of unique responses for each variable.
def pd_df_stats(df):
df_num = df.select_dtypes(exclude = 'object')
df_cat = df.select_dtypes(include = 'object')
for col in df_num.columns: # Variance only calculated for numeric columns
v = df[col].var()
v_vals = df[col].unique().size
print(col + " Variance: " + str(v))
print(col + " Number of Unique Reponses: " + str(v_vals))
print('\n')
for col in df_cat.columns:
v_vals = df[col].unique().size
print(col + " Number of Unique Reponses: " + str(v_vals))
print('\n')
# DF numeric columns
df_num = data.select_dtypes(exclude = 'object')
# DF categorical columns
df_cat = data.select_dtypes(include = 'object')
# Create df of columns we don't want to calculate stats for at the moment.
col_ignore = ['site_id_msba', 'capital_projects.soft_opening_date', 'calendar.calendar_day_date', 'calendar.day_of_week',
'calendar_information.holiday', 'calendar_information.type_of_day', 'calendar.fiscal_week_id_for_year']
Let's review how many categorical variables we have in the data set.
df_cat.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 13542 entries, 0 to 13541 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 lottery 13542 non-null object 1 freal 13542 non-null object 2 bonfire_grill 13542 non-null object 3 pizza 13542 non-null object 4 cinnabon 13542 non-null object 5 godfather_s_pizza 13542 non-null object 6 ethanol_free 13542 non-null object 7 diesel_x 13542 non-null object 8 hi_flow_lanes 13542 non-null object 9 rv_lanes 13542 non-null object 10 hi_flow_rv_lanes 13542 non-null object 11 def 13542 non-null object 12 cat_scales 13542 non-null object 13 car_wash 13542 non-null object 14 ev_charging 13542 non-null object 15 rv_dumps 13542 non-null object 16 propane 13542 non-null object 17 traditional_forecourt_layout 13542 non-null object 18 traditional_forecourt_stack_type 13542 non-null object 19 rv_lanes_layout 13542 non-null object 20 rv_lanes_stack_type 13542 non-null object 21 hi_flow_lanes_layout 13542 non-null object 22 hi_flow_lanes_stack_type 13542 non-null object 23 hi_flow_rv_lanes_layout 13542 non-null object 24 hi_flow_rv_lanes_stack_type 13542 non-null object 25 non_24_hour 13542 non-null object 26 self_check_out 13542 non-null object 27 capital_projects.soft_opening_date 13542 non-null object 28 calendar.day_of_week 13542 non-null object 29 calendar_information.holiday 13542 non-null object 30 calendar_information.type_of_day 13542 non-null object dtypes: object(31) memory usage: 3.3+ MB
We have 32 categorical variables contained in the data.
Below is a numeric value for the number of unique responses in each variable and the variance for any numeric variables.
pd_df_stats(data.drop(col_ignore, axis = 1))
open_year Variance: 0.21915424027528005 open_year Number of Unique Reponses: 2 square_feet Variance: 322755.8146969064 square_feet Number of Unique Reponses: 9 front_door_count Variance: 0.0 front_door_count Number of Unique Reponses: 1 years_since_last_project Variance: 0.22792040988629125 years_since_last_project Number of Unique Reponses: 2 parking_spaces Variance: 34.081406419343054 parking_spaces Number of Unique Reponses: 21 x1_mile_pop Variance: 31547830.955149226 x1_mile_pop Number of Unique Reponses: 37 x1_mile_emp Variance: 21468667.45929181 x1_mile_emp Number of Unique Reponses: 37 x1_mile_income Variance: 576136159.3301225 x1_mile_income Number of Unique Reponses: 36 x1_2_mile_pop Variance: 3568897.8555298517 x1_2_mile_pop Number of Unique Reponses: 35 x1_2_mile_emp Variance: 6030180.006107575 x1_2_mile_emp Number of Unique Reponses: 37 x1_2_mile_income Variance: 767875765.8956003 x1_2_mile_income Number of Unique Reponses: 33 x5_min_pop Variance: 175227453.49361798 x5_min_pop Number of Unique Reponses: 37 x5_min_emp Variance: 72927528.0679498 x5_min_emp Number of Unique Reponses: 37 x5_min_inc Variance: 577991590.2248507 x5_min_inc Number of Unique Reponses: 36 x7_min_pop Variance: 927558798.2259207 x7_min_pop Number of Unique Reponses: 37 x7_min_emp Variance: 408010140.3701671 x7_min_emp Number of Unique Reponses: 37 x7_min_inc Variance: 350068170.3631853 x7_min_inc Number of Unique Reponses: 37 traditional_forecourt_fueling_positions Variance: 15.17131754012339 traditional_forecourt_fueling_positions Number of Unique Reponses: 6 rv_lanes_fueling_positions Variance: 4.087957095268225 rv_lanes_fueling_positions Number of Unique Reponses: 5 hi_flow_lanes_fueling_positions Variance: 8.327861130460644 hi_flow_lanes_fueling_positions Number of Unique Reponses: 6 hi_flow_lanes_fueling_positions_2 Variance: 8.327861130460644 hi_flow_lanes_fueling_positions_2 Number of Unique Reponses: 6 rv_lanes_fueling_positions_2 Variance: 4.087957095268225 rv_lanes_fueling_positions_2 Number of Unique Reponses: 5 mens_toilet_count Variance: 0.8298640565090605 mens_toilet_count Number of Unique Reponses: 6 mens_urinal_count Variance: 0.714442823297413 mens_urinal_count Number of Unique Reponses: 5 womens_toilet_count Variance: 2.9848807525493144 womens_toilet_count Number of Unique Reponses: 6 womens_sink_count Variance: 0.5332753180031814 womens_sink_count Number of Unique Reponses: 5 daily_yoy_ndt.total_inside_sales Variance: 976919.8145904752 daily_yoy_ndt.total_inside_sales Number of Unique Reponses: 13433 daily_yoy_ndt.total_food_service Variance: 117578.08104941822 daily_yoy_ndt.total_food_service Number of Unique Reponses: 13182 diesel_y Variance: 4747015.830054692 diesel_y Number of Unique Reponses: 13422 unleaded Variance: 1073535.4208178206 unleaded Number of Unique Reponses: 13453 lottery Number of Unique Reponses: 2 freal Number of Unique Reponses: 2 bonfire_grill Number of Unique Reponses: 2 pizza Number of Unique Reponses: 2 cinnabon Number of Unique Reponses: 2 godfather_s_pizza Number of Unique Reponses: 1 ethanol_free Number of Unique Reponses: 2 diesel_x Number of Unique Reponses: 1 hi_flow_lanes Number of Unique Reponses: 2 rv_lanes Number of Unique Reponses: 2 hi_flow_rv_lanes Number of Unique Reponses: 2 def Number of Unique Reponses: 2 cat_scales Number of Unique Reponses: 2 car_wash Number of Unique Reponses: 1 ev_charging Number of Unique Reponses: 1 rv_dumps Number of Unique Reponses: 2 propane Number of Unique Reponses: 2 traditional_forecourt_layout Number of Unique Reponses: 2 traditional_forecourt_stack_type Number of Unique Reponses: 3 rv_lanes_layout Number of Unique Reponses: 3 rv_lanes_stack_type Number of Unique Reponses: 2 hi_flow_lanes_layout Number of Unique Reponses: 3 hi_flow_lanes_stack_type Number of Unique Reponses: 2 hi_flow_rv_lanes_layout Number of Unique Reponses: 4 hi_flow_rv_lanes_stack_type Number of Unique Reponses: 2 non_24_hour Number of Unique Reponses: 1 self_check_out Number of Unique Reponses: 1
The results show multiple consistencies across the stores. We can see that all stores are open 24 hours with "None" as the only category for non_24_hours, double negative to account for, all Maverik stores have a self_check_out, no Maverik stores have electric vehicle charging stations, no Maverik stores have a Godfather's pizza, all sell diesel, and no stores have car washes available.
# Importing the necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Creating the dataframes with raw string literals
time = pd.read_csv(r'C:\Users\jusha\Documents\IS 6813 MSBA Final Capstone Maverik Project Folder\time_series_data_msba.csv')
qual = pd.read_csv(r'C:\Users\jusha\Documents\IS 6813 MSBA Final Capstone Maverik Project Folder\qualitative_data_msba.csv')
# List of categorical columns you want to examine
categorical_columns = [
'lottery', 'freal', 'bonfire_grill', 'pizza', 'cinnabon', 'godfather_s_pizza',
'ethanol_free', 'diesel', 'hi_flow_lanes', 'rv_lanes', 'hi_flow_rv_lanes', 'def',
'cat_scales', 'car_wash', 'ev_charging', 'rv_dumps', 'propane',
'traditional_forecourt_layout', 'traditional_forecourt_stack_type',
'rv_lanes_layout', 'rv_lanes_stack_type',
'hi_flow_lanes_layout', 'hi_flow_lanes_stack_type',
'hi_flow_rv_lanes_layout', 'hi_flow_rv_lanes_stack_type',
'non_24_hour', 'self_check_out'
]
# Loop through the categorical columns and print unique categories
for col in categorical_columns:
unique_categories = qual[col].unique()
print(f'Unique categories in column {col}:')
for category in unique_categories:
print(category)
print('\n')
Unique categories in column lottery: Yes No Unique categories in column freal: Yes No Unique categories in column bonfire_grill: Yes No Unique categories in column pizza: No Yes Unique categories in column cinnabon: No Yes Unique categories in column godfather_s_pizza: No Unique categories in column ethanol_free: Yes No Unique categories in column diesel: Yes Unique categories in column hi_flow_lanes: Yes No Unique categories in column rv_lanes: Yes No Unique categories in column hi_flow_rv_lanes: Yes No Unique categories in column def: Yes No Unique categories in column cat_scales: No Yes Unique categories in column car_wash: No Unique categories in column ev_charging: No Unique categories in column rv_dumps: Yes No Unique categories in column propane: Yes No Unique categories in column traditional_forecourt_layout: Stack In-Line Unique categories in column traditional_forecourt_stack_type: Large None Extra-Large Unique categories in column rv_lanes_layout: Stack In-Line nan Unique categories in column rv_lanes_stack_type: HF/RV None nan Unique categories in column hi_flow_lanes_layout: Stack Combo nan Unique categories in column hi_flow_lanes_stack_type: HF/RV nan Unique categories in column hi_flow_rv_lanes_layout: Stack Combo In-Line nan Unique categories in column hi_flow_rv_lanes_stack_type: HF/RV None nan Unique categories in column non_24_hour: No Unique categories in column self_check_out: Yes
From this output we can get a general idea of the type of categories that are used for each feature of the stores. These categories allow us to see some consistency across the stores. We can see that all stores are open 24 hours with "None" as the only category for non_24_hours, double negative to account for, all Maverik stores have a self_check_out, no Maverik stores have electric vehicle charging stations, no Maveriks have a Godfather's pizza, all sell diesel, and no stores have car washes available.
categorical_columns = [
'lottery', 'freal', 'bonfire_grill', 'pizza', 'cinnabon', 'godfather_s_pizza',
'ethanol_free', 'diesel', 'hi_flow_lanes', 'rv_lanes', 'hi_flow_rv_lanes', 'def',
'cat_scales', 'car_wash', 'ev_charging', 'rv_dumps', 'propane',
'traditional_forecourt_layout', 'traditional_forecourt_stack_type',
'rv_lanes_layout', 'rv_lanes_stack_type',
'hi_flow_lanes_layout', 'hi_flow_lanes_stack_type',
'hi_flow_rv_lanes_layout', 'hi_flow_rv_lanes_stack_type',
'non_24_hour', 'self_check_out'
]
# Create subplots to display bar plots for each categorical variable
num_plots = len(categorical_columns)
num_cols = 3 # Number of columns in the subplot grid
num_rows = (num_plots + num_cols - 1) // num_cols # Calculate the number of rows
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 3 * num_rows))
fig.subplots_adjust(wspace=0.5, hspace=0.5) # Adjust spacing between subplots
for i, col in enumerate(categorical_columns):
row_idx = i // num_cols
col_idx = i % num_cols
ax = axes[row_idx, col_idx]
# Count the occurrences of each category in the column
category_counts = q_data[col].value_counts()
# Create a bar plot for the current categorical variable
category_counts.plot(kind='bar', color='skyblue', ax=ax)
ax.set_title(f'Distribution of {col}')
ax.set_xlabel('Categories')
ax.set_ylabel('Count')
ax.tick_params(axis='x', rotation=45) # Rotate x-axis labels for readability
# Ensure all subplots are properly displayed
plt.tight_layout()
# Show the plots
plt.show()
These distribution charts allow us to see the features standard to Maverik stores. This visualization allows us to understand how product and service offerings vary throughout store locations. The difference in offerings could significantly impact the four target metrics.
We can gain additional insight into the variables by looking at their individual distributions.
import matplotlib.pyplot as plt
for i, col in enumerate(data.drop(col_ignore, axis = 1).columns):
plt.figure(i)
sns.histplot(data[col], stat = 'density', kde= True, kde_kws={"cut": 3}).set(title = f'{col} Density Plot')
<ipython-input-30-7ff3b192501f>:4: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`. plt.figure(i)
From the above numeric outputs and plots we can see that there are multiple variables that only contain one response. This means they add no additional values to our analysis. It will be best to remove these variables. ('self_check_out', 'non_24_hour', 'ev_charging', 'car_wash', 'diesel_x', 'godfather_s_pizza', 'front_door_count')
data = data.drop(['self_check_out', 'non_24_hour', 'ev_charging', 'car_wash', 'diesel_x', 'godfather_s_pizza', 'front_door_count'], axis = 1)
We wanted to see if we could find additional data to use for our model development. We pulled data from the stock market for crude oil prices, regular & diesel fuel prices, and average monthly state temperature data for all the state Maverik currently has a location. We wanted to answer a couple of questions including:
oil.head()
| Date | Crude Oil Cushing, OK WTI Spot Price FOB (Dollars per Barrel) | New York Harbor Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) | U.S. Gulf Coast Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) | Los Angeles Reformulated RBOB Regular Gasoline Spot Price (Dollars per Gallon) | New York Harbor Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) | U.S. Gulf Coast Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) | Los Angeles, CA Ultra-Low Sulfur CARB Diesel Spot Price (Dollars per Gallon) | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1/4/2021 | 47.47 | 1.402 | 1.337 | 1.534 | 1.459 | 1.397 | 1.484 |
| 1 | 1/5/2021 | 49.78 | 1.485 | 1.423 | 1.608 | 1.521 | 1.471 | 1.546 |
| 2 | 1/6/2021 | 50.45 | 1.503 | 1.436 | 1.621 | 1.524 | 1.461 | 1.526 |
| 3 | 1/7/2021 | 50.63 | 1.512 | 1.455 | 1.625 | 1.537 | 1.491 | 1.525 |
| 4 | 1/8/2021 | 52.14 | 1.576 | 1.519 | 1.659 | 1.582 | 1.535 | 1.569 |
temp_data.head()
| Date | Utah | Idaho | Oregon | South Dakota | Nebraska | New Mexico | Washington | Arizona | Colorado | Nevada | Wyoming | W. Monthly Average Temp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1/12/2021 | 28.1 | 26.6 | 34.8 | 26.5 | 29.3 | 35.1 | 34.6 | 42.6 | 26.7 | 33.7 | 23.4 | 29.1 |
| 1 | 2/1/2021 | 32.2 | 25.4 | 33.6 | 12.8 | 16.5 | 37.3 | 31.4 | 46.5 | 24.8 | 35.6 | 17.4 | 30.1 |
| 2 | 3/1/2021 | 38.4 | 34.8 | 39.0 | 39.4 | 42.6 | 44.8 | 39.7 | 49.4 | 35.8 | 39.2 | 32.2 | 38.0 |
| 3 | 4/1/2021 | 47.4 | 42.0 | 46.6 | 43.6 | 47.0 | 53.2 | 46.8 | 61.2 | 43.1 | 49.3 | 38.0 | 46.4 |
| 4 | 5/1/2021 | 57.1 | 49.8 | 52.0 | 55.0 | 57.9 | 62.7 | 52.6 | 67.9 | 52.9 | 57.2 | 48.8 | 55.4 |
How to identify a yearly week and yearly week day from a date value? How to identify a US Holiday from a date value?
from datetime import date
import holidays
def get_hoiday(date):
US_holidays = holidays.US()
How many holidays are being flagged in the data from Maverick?
print(data['calendar_information.holiday'].unique().size)
data['calendar_information.holiday'].unique()
26
array(['NONE', "Lincoln's Birthday", "Washington's Birthday", 'Easter',
'Memorial Day', "Father's Day", "Veteran's Day",
"Saint Valentine's Day", "Mother's Day", 'Independence Day',
'Christmas Day', "New Year's Eve", 'Martin Luther King Day',
"President's Day", "Saint Patrick's Day", 'Labor Day',
'Columbus Day', 'Thanksgiving Day', 'Christmas Eve', 'Palm Sunday',
'Good Friday', 'Ascension', 'Flag Day', 'Halloween Day',
"All Saint's Day", "New Year's Day"], dtype=object)
There are 26 holidays being flagged by Maverick. Some of these seem to be Catholic holidays. For example, 'Good Friday' and 'Palm Sunday'. These seem odd to be tracking as they are not federal holidays where people are off work or traveling.
For simplicity, we are instead only going to track federal holidays. We will use the holidays.US package to identify holidays moving forward. To do this we will drop the current holidays data and add a new holidays variable using the previously mentioned package.
data = data.drop('calendar_information.holiday', axis = 1)
data['Holidays'] = pd.Series(data['calendar.calendar_day_date']).apply(lambda x: holidays.US().get(x)).values
data.Holidays.unique() # We are only including US Federal Holidays.
array([None, 'Memorial Day', 'Juneteenth National Independence Day',
'Veterans Day', 'Juneteenth National Independence Day (Observed)',
'Independence Day', 'Christmas Day', "New Year's Day (Observed)",
'Martin Luther King Jr. Day', "Washington's Birthday",
'Independence Day (Observed)', 'Labor Day', 'Columbus Day',
'Thanksgiving', 'Christmas Day (Observed)', "New Year's Day"],
dtype=object)
percentage_na_values_table(data).head(10)
DF has 57 columns. There are 1 columns that have missing values.
| Total NA's in Column | Percentage NA | |
|---|---|---|
| Holidays | 13017 | 96.1 |
data.fillna('Not Holiday', inplace = True)
percentage_na_values_table(data).head(10)
DF has 57 columns. There are 0 columns that have missing values.
| Total NA's in Column | Percentage NA |
|---|
These next few user-defined functions will be used both in the cleaning process and building out our predictions.
def get_holidays(date_value):
return date_value in holidays.US()
def get_day_of_week(date_value):
d = datetime.strptime(date_value, '%m/%d/%Y').weekday()
return calendar.day_name[d]
def get_type_of_day(date_value):
d = datetime.strptime(date_value, '%m/%d/%Y').weekday()
if d < 5:
return 'WEEKDAY'
else:
return 'WEEKEND'
def get_week_num(date_value):
d = datetime.strptime(date_value, '%m/%d/%Y').strftime('%V')
return d
We can now work on joining the oil and temp data to Maverick's data.
oil['Date'] = pd.to_datetime(oil['Date'], errors='coerce')
temp_data['Date'] = pd.to_datetime(temp_data['Date'], errors='coerce')
data = pd.merge(data, oil, left_on = 'calendar.calendar_day_date', right_on = 'Date', how = 'left', )
data = data.drop('Date', axis = 1)
percentage_na_values_table(data).head(10)
DF has 64 columns. There are 7 columns that have missing values.
| Total NA's in Column | Percentage NA | |
|---|---|---|
| New York Harbor Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) | 4237 | 31.3 |
| U.S. Gulf Coast Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) | 4237 | 31.3 |
| Los Angeles Reformulated RBOB Regular Gasoline Spot Price (Dollars per Gallon) | 4237 | 31.3 |
| New York Harbor Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) | 4237 | 31.3 |
| U.S. Gulf Coast Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) | 4237 | 31.3 |
| Los Angeles, CA Ultra-Low Sulfur CARB Diesel Spot Price (Dollars per Gallon) | 4237 | 31.3 |
| Crude Oil Cushing, OK WTI Spot Price FOB (Dollars per Barrel) | 3924 | 29.0 |
When we join the Oil data to the Maverik data, we get missing values for days when the stock market is closed. In this case, weekends and any US Holiday. We will use the "last observation carried forward" to impute missing data; if needed we will then apply "next observation carried back" to fill in any other missing data points. This means that the prices for Friday will carry over to Saturday and Sunday.
hol = data['Holidays']
data = data.drop('Holidays', axis = 1)
data = data.fillna(method = 'ffill')
data = data.fillna(method = 'bfill')
data['Holidays'] = hol
percentage_na_values_table(data).head(10)
DF has 64 columns. There are 0 columns that have missing values.
| Total NA's in Column | Percentage NA |
|---|
data = pd.merge(data, temp_data, left_on = 'calendar.calendar_day_date', right_on = 'Date', how = 'left')
data = data.drop('Date', axis = 1)
data.head()
| open_year | square_feet | years_since_last_project | parking_spaces | lottery | freal | bonfire_grill | pizza | cinnabon | ethanol_free | ... | Oregon | South Dakota | Nebraska | New Mexico | Washington | Arizona | Colorado | Nevada | Wyoming | W. Monthly Average Temp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 76 columns
Following the same method we used on the missing oil data, we will use "last observation carried forward" and if needed "next observation carried back" to impute missing data.
data = data.drop('Holidays', axis = 1)
data = data.fillna(method = 'ffill')
data = data.fillna(method = 'bfill')
data['Holidays'] = hol
percentage_na_values_table(data).head(10)
DF has 76 columns. There are 0 columns that have missing values.
| Total NA's in Column | Percentage NA |
|---|
data.head()
| open_year | square_feet | years_since_last_project | parking_spaces | lottery | freal | bonfire_grill | pizza | cinnabon | ethanol_free | ... | South Dakota | Nebraska | New Mexico | Washington | Arizona | Colorado | Nevada | Wyoming | W. Monthly Average Temp | Holidays | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | 39.4 | 42.6 | 44.8 | 39.7 | 49.4 | 35.8 | 39.2 | 32.2 | 38.0 | Not Holiday |
| 1 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | 39.4 | 42.6 | 44.8 | 39.7 | 49.4 | 35.8 | 39.2 | 32.2 | 38.0 | Not Holiday |
| 2 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | 39.4 | 42.6 | 44.8 | 39.7 | 49.4 | 35.8 | 39.2 | 32.2 | 38.0 | Not Holiday |
| 3 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | 39.4 | 42.6 | 44.8 | 39.7 | 49.4 | 35.8 | 39.2 | 32.2 | 38.0 | Not Holiday |
| 4 | 2021 | 5046 | 2 | 38 | Yes | Yes | Yes | No | No | Yes | ... | 39.4 | 42.6 | 44.8 | 39.7 | 49.4 | 35.8 | 39.2 | 32.2 | 38.0 | Not Holiday |
5 rows × 76 columns
We wanted to see the correlation between variables to see if there are any particular variables that correlate with each other. From looking through the plot, we did not find correlation to be out of the ordinary.
data.corr().style.background_gradient(cmap ='coolwarm')
<ipython-input-54-6c8c0036d498>:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. data.corr().style.background_gradient(cmap ='coolwarm')
| open_year | square_feet | years_since_last_project | parking_spaces | x1_mile_pop | x1_mile_emp | x1_mile_income | x1_2_mile_pop | x1_2_mile_emp | x1_2_mile_income | x5_min_pop | x5_min_emp | x5_min_inc | x7_min_pop | x7_min_emp | x7_min_inc | traditional_forecourt_fueling_positions | rv_lanes_fueling_positions | hi_flow_lanes_fueling_positions | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | calendar.fiscal_week_id_for_year | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel_y | unleaded | Crude Oil Cushing, OK WTI Spot Price FOB (Dollars per Barrel) | New York Harbor Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) | U.S. Gulf Coast Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) | Los Angeles Reformulated RBOB Regular Gasoline Spot Price (Dollars per Gallon) | New York Harbor Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) | U.S. Gulf Coast Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) | Los Angeles, CA Ultra-Low Sulfur CARB Diesel Spot Price (Dollars per Gallon) | Utah | Idaho | Oregon | South Dakota | Nebraska | New Mexico | Washington | Arizona | Colorado | Nevada | Wyoming | W. Monthly Average Temp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| open_year | 1.000000 | -0.192409 | -0.941357 | -0.028334 | -0.230505 | -0.235286 | 0.170861 | -0.220853 | -0.161754 | 0.072294 | -0.109812 | -0.180327 | 0.146184 | -0.094008 | -0.141329 | 0.243834 | 0.159450 | -0.061743 | 0.002163 | 0.002163 | -0.061743 | -0.224399 | -0.219693 | -0.326963 | 0.123938 | 0.671855 | -0.000880 | -0.069297 | 0.026521 | -0.045712 | -0.221243 | -0.015587 | 0.111766 | 0.055195 | 0.116480 | 0.219266 | 0.215758 | 0.201209 | -0.055653 | -0.030094 | -0.047792 | -0.020774 | -0.030016 | -0.021913 | 0.008629 | -0.067927 | -0.051333 | -0.073312 | -0.052838 | -0.048845 |
| square_feet | -0.192409 | 1.000000 | 0.181125 | 0.512273 | 0.081303 | 0.067731 | 0.124278 | 0.038785 | 0.065212 | 0.013999 | 0.031238 | 0.013214 | 0.118845 | 0.044482 | -0.043739 | -0.047139 | 0.101104 | 0.160360 | 0.079428 | 0.079428 | 0.160360 | 0.238842 | 0.253151 | -0.036418 | -0.252981 | -0.203161 | 0.000121 | -0.196982 | -0.108011 | 0.075498 | -0.069720 | 0.004553 | -0.022697 | -0.012559 | -0.021243 | -0.043348 | -0.041276 | -0.038229 | 0.035205 | 0.024390 | 0.027531 | 0.018392 | 0.025206 | 0.034425 | 0.016107 | 0.043985 | 0.033964 | 0.037200 | 0.029308 | 0.032454 |
| years_since_last_project | -0.941357 | 0.181125 | 1.000000 | 0.022017 | 0.261263 | 0.257998 | -0.109185 | 0.247346 | 0.162166 | -0.029414 | 0.100381 | 0.164958 | -0.096915 | 0.106509 | 0.151441 | -0.181812 | -0.239638 | 0.018920 | -0.034996 | -0.034996 | 0.018920 | 0.181408 | 0.171980 | 0.276330 | -0.144581 | -0.608462 | 0.000534 | 0.088361 | -0.018858 | 0.050270 | 0.201761 | -0.005757 | -0.133397 | -0.078795 | -0.141029 | -0.236430 | -0.236312 | -0.224180 | 0.056122 | 0.032672 | 0.045896 | 0.024845 | 0.033402 | 0.025374 | -0.007749 | 0.066775 | 0.053654 | 0.071321 | 0.055134 | 0.049899 |
| parking_spaces | -0.028334 | 0.512273 | 0.022017 | 1.000000 | 0.018800 | 0.149524 | 0.163143 | 0.043946 | 0.240364 | 0.004789 | -0.016454 | -0.038980 | 0.132765 | -0.053546 | -0.057206 | 0.026533 | -0.038102 | 0.133496 | 0.078831 | 0.078831 | 0.133496 | 0.281180 | 0.338129 | 0.008764 | 0.167756 | 0.059525 | 0.000298 | -0.227230 | -0.152398 | 0.144365 | -0.073563 | -0.057014 | -0.074864 | -0.072173 | -0.073249 | -0.091553 | -0.092602 | -0.094888 | 0.039701 | 0.045855 | 0.052789 | 0.042652 | 0.036869 | 0.036282 | 0.051183 | 0.039142 | 0.037052 | 0.043851 | 0.042974 | 0.041602 |
| x1_mile_pop | -0.230505 | 0.081303 | 0.261263 | 0.018800 | 1.000000 | 0.535985 | 0.249023 | 0.914542 | 0.339833 | 0.275087 | 0.821745 | 0.448041 | 0.398488 | 0.714997 | 0.320397 | 0.324207 | -0.180283 | -0.622701 | -0.627867 | -0.627867 | -0.622701 | -0.325018 | -0.291902 | -0.251707 | -0.584866 | -0.106862 | -0.000018 | -0.146207 | -0.364787 | -0.412535 | -0.128380 | 0.058864 | 0.036827 | 0.056064 | 0.016890 | -0.008804 | -0.005068 | 0.000400 | 0.007527 | -0.000648 | 0.005157 | -0.003295 | -0.000368 | -0.002960 | -0.014709 | 0.010503 | 0.007582 | 0.016895 | 0.007053 | 0.005740 |
| x1_mile_emp | -0.235286 | 0.067731 | 0.257998 | 0.149524 | 0.535985 | 1.000000 | -0.246582 | 0.469044 | 0.887929 | -0.179146 | 0.380305 | 0.748979 | -0.155504 | 0.378551 | 0.636270 | -0.189042 | -0.166185 | -0.190178 | -0.142028 | -0.142028 | -0.190178 | -0.028772 | -0.035912 | -0.043336 | -0.153413 | -0.217948 | -0.000348 | -0.092304 | -0.194294 | -0.059426 | -0.057403 | -0.006478 | -0.042594 | -0.022153 | -0.052771 | -0.085423 | -0.083834 | -0.082242 | 0.000305 | -0.005055 | -0.002466 | -0.003297 | -0.001969 | -0.008584 | -0.018011 | 0.002267 | 0.003971 | 0.005010 | 0.004445 | -0.000492 |
| x1_mile_income | 0.170861 | 0.124278 | -0.109185 | 0.163143 | 0.249023 | -0.246582 | 1.000000 | 0.219782 | -0.188947 | 0.885602 | 0.430553 | -0.067320 | 0.916598 | 0.403490 | -0.076149 | 0.772807 | -0.039964 | -0.182524 | -0.255587 | -0.255587 | -0.182524 | -0.297923 | -0.141283 | -0.258020 | -0.322191 | 0.278455 | 0.000254 | -0.138241 | -0.224750 | -0.182090 | -0.013171 | 0.033294 | 0.056214 | 0.044057 | 0.054176 | 0.080157 | 0.078923 | 0.074755 | -0.033512 | -0.027646 | -0.023350 | -0.033782 | -0.034434 | -0.030729 | -0.019198 | -0.036744 | -0.036112 | -0.029606 | -0.033132 | -0.031946 |
| x1_2_mile_pop | -0.220853 | 0.038785 | 0.247346 | 0.043946 | 0.914542 | 0.469044 | 0.219782 | 1.000000 | 0.297205 | 0.243947 | 0.700259 | 0.398711 | 0.317704 | 0.600899 | 0.305795 | 0.266174 | -0.170069 | -0.581501 | -0.548081 | -0.548081 | -0.581501 | -0.359655 | -0.252016 | -0.275491 | -0.537279 | -0.094199 | 0.000509 | -0.121629 | -0.306799 | -0.357292 | -0.116141 | 0.083356 | 0.078264 | 0.094329 | 0.058005 | 0.036619 | 0.041878 | 0.048122 | -0.018028 | -0.026489 | -0.018848 | -0.028190 | -0.025024 | -0.026624 | -0.038876 | -0.014272 | -0.017240 | -0.008257 | -0.018491 | -0.019699 |
| x1_2_mile_emp | -0.161754 | 0.065212 | 0.162166 | 0.240364 | 0.339833 | 0.887929 | -0.188947 | 0.297205 | 1.000000 | -0.172321 | 0.200069 | 0.531181 | -0.168016 | 0.160154 | 0.378672 | -0.258327 | -0.184296 | 0.010716 | 0.053525 | 0.053525 | 0.010716 | 0.059919 | 0.087382 | 0.018092 | -0.010498 | -0.097665 | -0.000215 | -0.035234 | -0.092283 | 0.107757 | -0.000437 | -0.020377 | -0.049467 | -0.035299 | -0.050152 | -0.070121 | -0.070493 | -0.071627 | -0.010545 | -0.012762 | -0.012346 | -0.011296 | -0.010555 | -0.016387 | -0.021654 | -0.009949 | -0.007521 | -0.008001 | -0.006230 | -0.010729 |
| x1_2_mile_income | 0.072294 | 0.013999 | -0.029414 | 0.004789 | 0.275087 | -0.179146 | 0.885602 | 0.243947 | -0.172321 | 1.000000 | 0.436332 | 0.055766 | 0.778329 | 0.406171 | 0.047975 | 0.682272 | -0.047739 | -0.151180 | -0.274691 | -0.274691 | -0.151180 | -0.374747 | -0.243045 | -0.244836 | -0.431842 | 0.131216 | -0.000192 | -0.141035 | -0.231706 | -0.321937 | -0.015916 | 0.013977 | 0.015495 | 0.010410 | 0.014994 | 0.024507 | 0.023652 | 0.021122 | -0.034467 | -0.031419 | -0.026358 | -0.034813 | -0.036313 | -0.037008 | -0.027705 | -0.037357 | -0.036772 | -0.028832 | -0.034716 | -0.033716 |
| x5_min_pop | -0.109812 | 0.031238 | 0.100381 | -0.016454 | 0.821745 | 0.380305 | 0.430553 | 0.700259 | 0.200069 | 0.436332 | 1.000000 | 0.537709 | 0.523531 | 0.948629 | 0.447480 | 0.467086 | -0.117247 | -0.511173 | -0.538993 | -0.538993 | -0.511173 | -0.323835 | -0.214876 | -0.234472 | -0.561176 | -0.160090 | -0.000751 | -0.133918 | -0.338342 | -0.373938 | -0.168466 | 0.039332 | 0.013689 | 0.030691 | -0.000893 | -0.017328 | -0.015423 | -0.010920 | 0.015755 | 0.008787 | 0.013503 | 0.006831 | 0.008799 | 0.005404 | -0.003854 | 0.016897 | 0.015937 | 0.023099 | 0.015149 | 0.014076 |
| x5_min_emp | -0.180327 | 0.013214 | 0.164958 | -0.038980 | 0.448041 | 0.748979 | -0.067320 | 0.398711 | 0.531181 | 0.055766 | 0.537709 | 1.000000 | -0.032527 | 0.588993 | 0.918451 | -0.039184 | -0.025244 | -0.255966 | -0.220977 | -0.220977 | -0.255966 | -0.243266 | -0.181535 | -0.121102 | -0.345355 | -0.300217 | -0.000641 | -0.173500 | -0.251672 | -0.163803 | -0.208143 | -0.005632 | -0.038843 | -0.021156 | -0.045799 | -0.074726 | -0.072745 | -0.070289 | 0.000544 | -0.006572 | -0.003293 | -0.004501 | -0.001276 | -0.007784 | -0.019537 | 0.003551 | 0.005659 | 0.004964 | 0.003100 | -0.000620 |
| x5_min_inc | 0.146184 | 0.118845 | -0.096915 | 0.132765 | 0.398488 | -0.155504 | 0.916598 | 0.317704 | -0.168016 | 0.778329 | 0.523531 | -0.032527 | 1.000000 | 0.478581 | -0.072960 | 0.877708 | 0.005904 | -0.392845 | -0.433214 | -0.433214 | -0.392845 | -0.333490 | -0.269849 | -0.364046 | -0.385129 | 0.248457 | 0.000714 | -0.183157 | -0.319879 | -0.265728 | 0.099381 | 0.076541 | 0.112430 | 0.100207 | 0.100670 | 0.126263 | 0.129549 | 0.127420 | -0.022023 | -0.015966 | -0.012388 | -0.022670 | -0.021057 | -0.017104 | -0.010304 | -0.025185 | -0.022489 | -0.018437 | -0.019923 | -0.020090 |
| x7_min_pop | -0.094008 | 0.044482 | 0.106509 | -0.053546 | 0.714997 | 0.378551 | 0.403490 | 0.600899 | 0.160154 | 0.406171 | 0.948629 | 0.588993 | 0.478581 | 1.000000 | 0.579978 | 0.436803 | -0.073419 | -0.457346 | -0.438330 | -0.438330 | -0.457346 | -0.245009 | -0.171626 | -0.154260 | -0.471007 | -0.228221 | -0.001028 | -0.099636 | -0.293003 | -0.300413 | -0.203366 | 0.028706 | -0.001558 | 0.014971 | -0.013134 | -0.026863 | -0.025627 | -0.022897 | 0.015653 | 0.007934 | 0.012450 | 0.005452 | 0.008033 | 0.005040 | -0.005917 | 0.017064 | 0.015889 | 0.022840 | 0.014547 | 0.013711 |
| x7_min_emp | -0.141329 | -0.043739 | 0.151441 | -0.057206 | 0.320397 | 0.636270 | -0.076149 | 0.305795 | 0.378672 | 0.047975 | 0.447480 | 0.918451 | -0.072960 | 0.579978 | 1.000000 | -0.052566 | -0.031672 | -0.190962 | -0.095881 | -0.095881 | -0.190962 | -0.153198 | -0.087910 | -0.037604 | -0.190749 | -0.297454 | -0.000781 | -0.131280 | -0.198665 | -0.105679 | -0.240116 | -0.029025 | -0.065856 | -0.049700 | -0.068488 | -0.097558 | -0.097000 | -0.096472 | -0.016554 | -0.022529 | -0.019123 | -0.020469 | -0.018873 | -0.024887 | -0.033577 | -0.013186 | -0.012388 | -0.012219 | -0.014626 | -0.017574 |
| x7_min_inc | 0.243834 | -0.047139 | -0.181812 | 0.026533 | 0.324207 | -0.189042 | 0.772807 | 0.266174 | -0.258327 | 0.682272 | 0.467086 | -0.039184 | 0.877708 | 0.436803 | -0.052566 | 1.000000 | 0.082837 | -0.398963 | -0.405198 | -0.405198 | -0.398963 | -0.382187 | -0.296975 | -0.334350 | -0.292064 | 0.244569 | 0.000376 | -0.062838 | -0.167597 | -0.264862 | 0.004000 | 0.059577 | 0.110350 | 0.092064 | 0.095448 | 0.132509 | 0.133461 | 0.129742 | -0.024250 | -0.013359 | -0.009183 | -0.020458 | -0.022092 | -0.019137 | -0.000141 | -0.031042 | -0.026716 | -0.021298 | -0.022066 | -0.021285 |
| traditional_forecourt_fueling_positions | 0.159450 | 0.101104 | -0.239638 | -0.038102 | -0.180283 | -0.166185 | -0.039964 | -0.170069 | -0.184296 | -0.047739 | -0.117247 | -0.025244 | 0.005904 | -0.073419 | -0.031672 | 0.082837 | 1.000000 | -0.038217 | -0.017417 | -0.017417 | -0.038217 | 0.047352 | 0.069673 | -0.001954 | 0.142282 | -0.207402 | 0.001060 | -0.074664 | 0.048519 | 0.256609 | 0.039938 | 0.029964 | 0.083021 | 0.063797 | 0.089822 | 0.116432 | 0.118120 | 0.118398 | -0.063785 | -0.061655 | -0.060413 | -0.065379 | -0.067090 | -0.063591 | -0.049709 | -0.068920 | -0.068244 | -0.066877 | -0.068085 | -0.064356 |
| rv_lanes_fueling_positions | -0.061743 | 0.160360 | 0.018920 | 0.133496 | -0.622701 | -0.190178 | -0.182524 | -0.581501 | 0.010716 | -0.151180 | -0.511173 | -0.255966 | -0.392845 | -0.457346 | -0.190962 | -0.398963 | -0.038217 | 1.000000 | 0.846987 | 0.846987 | 1.000000 | 0.496169 | 0.495422 | 0.399853 | 0.378001 | -0.183637 | -0.001112 | 0.313007 | 0.475144 | 0.465868 | 0.003051 | -0.113783 | -0.153564 | -0.150076 | -0.135411 | -0.140479 | -0.148726 | -0.152496 | 0.062098 | 0.061308 | 0.063067 | 0.060413 | 0.052227 | 0.051933 | 0.070434 | 0.061402 | 0.051048 | 0.061711 | 0.056074 | 0.060479 |
| hi_flow_lanes_fueling_positions | 0.002163 | 0.079428 | -0.034996 | 0.078831 | -0.627867 | -0.142028 | -0.255587 | -0.548081 | 0.053525 | -0.274691 | -0.538993 | -0.220977 | -0.433214 | -0.438330 | -0.095881 | -0.405198 | -0.017417 | 0.846987 | 1.000000 | 1.000000 | 0.846987 | 0.549648 | 0.629222 | 0.521613 | 0.533141 | -0.055337 | -0.000206 | 0.317313 | 0.508713 | 0.641990 | -0.036291 | -0.019848 | -0.024008 | -0.027927 | -0.013155 | 0.000052 | -0.003804 | -0.008115 | -0.023404 | -0.024019 | -0.021814 | -0.026227 | -0.029291 | -0.027887 | -0.012093 | -0.023134 | -0.031036 | -0.024134 | -0.027897 | -0.024467 |
| hi_flow_lanes_fueling_positions_2 | 0.002163 | 0.079428 | -0.034996 | 0.078831 | -0.627867 | -0.142028 | -0.255587 | -0.548081 | 0.053525 | -0.274691 | -0.538993 | -0.220977 | -0.433214 | -0.438330 | -0.095881 | -0.405198 | -0.017417 | 0.846987 | 1.000000 | 1.000000 | 0.846987 | 0.549648 | 0.629222 | 0.521613 | 0.533141 | -0.055337 | -0.000206 | 0.317313 | 0.508713 | 0.641990 | -0.036291 | -0.019848 | -0.024008 | -0.027927 | -0.013155 | 0.000052 | -0.003804 | -0.008115 | -0.023404 | -0.024019 | -0.021814 | -0.026227 | -0.029291 | -0.027887 | -0.012093 | -0.023134 | -0.031036 | -0.024134 | -0.027897 | -0.024467 |
| rv_lanes_fueling_positions_2 | -0.061743 | 0.160360 | 0.018920 | 0.133496 | -0.622701 | -0.190178 | -0.182524 | -0.581501 | 0.010716 | -0.151180 | -0.511173 | -0.255966 | -0.392845 | -0.457346 | -0.190962 | -0.398963 | -0.038217 | 1.000000 | 0.846987 | 0.846987 | 1.000000 | 0.496169 | 0.495422 | 0.399853 | 0.378001 | -0.183637 | -0.001112 | 0.313007 | 0.475144 | 0.465868 | 0.003051 | -0.113783 | -0.153564 | -0.150076 | -0.135411 | -0.140479 | -0.148726 | -0.152496 | 0.062098 | 0.061308 | 0.063067 | 0.060413 | 0.052227 | 0.051933 | 0.070434 | 0.061402 | 0.051048 | 0.061711 | 0.056074 | 0.060479 |
| mens_toilet_count | -0.224399 | 0.238842 | 0.181408 | 0.281180 | -0.325018 | -0.028772 | -0.297923 | -0.359655 | 0.059919 | -0.374747 | -0.323835 | -0.243266 | -0.333490 | -0.245009 | -0.153198 | -0.382187 | 0.047352 | 0.496169 | 0.549648 | 0.549648 | 0.496169 | 1.000000 | 0.845316 | 0.651207 | 0.575414 | -0.282865 | 0.000330 | 0.221174 | 0.271181 | 0.564103 | 0.135531 | 0.004687 | -0.020455 | -0.011776 | -0.018392 | -0.028916 | -0.028709 | -0.024698 | 0.022068 | 0.014460 | 0.023264 | 0.007533 | 0.004894 | 0.006492 | 0.018346 | 0.024479 | 0.010013 | 0.027694 | 0.014832 | 0.018788 |
| mens_urinal_count | -0.219693 | 0.253151 | 0.171980 | 0.338129 | -0.291902 | -0.035912 | -0.141283 | -0.252016 | 0.087382 | -0.243045 | -0.214876 | -0.181535 | -0.269849 | -0.171626 | -0.087910 | -0.296975 | 0.069673 | 0.495422 | 0.629222 | 0.629222 | 0.495422 | 0.845316 | 1.000000 | 0.713847 | 0.563347 | -0.225137 | 0.000427 | 0.181059 | 0.285151 | 0.618443 | 0.053281 | 0.017835 | -0.000086 | 0.009639 | -0.001233 | -0.011006 | -0.012266 | -0.007359 | -0.033451 | -0.041886 | -0.032590 | -0.047686 | -0.049409 | -0.048617 | -0.037480 | -0.029904 | -0.044481 | -0.027734 | -0.040755 | -0.036869 |
| womens_toilet_count | -0.326963 | -0.036418 | 0.276330 | 0.008764 | -0.251707 | -0.043336 | -0.258020 | -0.275491 | 0.018092 | -0.244836 | -0.234472 | -0.121102 | -0.364046 | -0.154260 | -0.037604 | -0.334350 | -0.001954 | 0.399853 | 0.521613 | 0.521613 | 0.399853 | 0.651207 | 0.713847 | 1.000000 | 0.645601 | -0.289790 | -0.000143 | 0.240171 | 0.290886 | 0.471961 | 0.030015 | 0.036361 | 0.005993 | 0.019120 | -0.004440 | -0.016323 | -0.016243 | -0.013106 | -0.020618 | -0.028287 | -0.023346 | -0.033131 | -0.032099 | -0.038713 | -0.033646 | -0.019770 | -0.028852 | -0.014284 | -0.024114 | -0.023736 |
| womens_sink_count | 0.123938 | -0.252981 | -0.144581 | 0.167756 | -0.584866 | -0.153413 | -0.322191 | -0.537279 | -0.010498 | -0.431842 | -0.561176 | -0.345355 | -0.385129 | -0.471007 | -0.190749 | -0.292064 | 0.142282 | 0.378001 | 0.533141 | 0.533141 | 0.378001 | 0.575414 | 0.563347 | 0.645601 | 1.000000 | 0.065244 | 0.000752 | 0.196257 | 0.331075 | 0.553721 | 0.154516 | 0.003834 | 0.036784 | 0.019101 | 0.040352 | 0.066638 | 0.067118 | 0.062515 | -0.056359 | -0.047700 | -0.048028 | -0.050728 | -0.053586 | -0.053713 | -0.037267 | -0.060489 | -0.059211 | -0.058401 | -0.055087 | -0.054656 |
| site_id_msba | 0.671855 | -0.203161 | -0.608462 | 0.059525 | -0.106862 | -0.217948 | 0.278455 | -0.094199 | -0.097665 | 0.131216 | -0.160090 | -0.300217 | 0.248457 | -0.228221 | -0.297454 | 0.244569 | -0.207402 | -0.183637 | -0.055337 | -0.055337 | -0.183637 | -0.282865 | -0.225137 | -0.289790 | 0.065244 | 1.000000 | 0.000772 | -0.103979 | -0.058388 | -0.166502 | -0.193344 | 0.063396 | 0.175865 | 0.133423 | 0.170910 | 0.243303 | 0.243848 | 0.233893 | -0.091511 | -0.070688 | -0.078922 | -0.065077 | -0.067801 | -0.061874 | -0.040810 | -0.098815 | -0.084240 | -0.102051 | -0.084101 | -0.084870 |
| calendar.fiscal_week_id_for_year | -0.000880 | 0.000121 | 0.000534 | 0.000298 | -0.000018 | -0.000348 | 0.000254 | 0.000509 | -0.000215 | -0.000192 | -0.000751 | -0.000641 | 0.000714 | -0.001028 | -0.000781 | 0.000376 | 0.001060 | -0.001112 | -0.000206 | -0.000206 | -0.001112 | 0.000330 | 0.000427 | -0.000143 | 0.000752 | 0.000772 | 1.000000 | 0.053392 | 0.053343 | 0.019301 | 0.027099 | -0.172594 | -0.152188 | -0.219238 | -0.139840 | -0.053504 | -0.097675 | -0.092032 | 0.296885 | 0.301069 | 0.292618 | 0.306534 | 0.305197 | 0.292941 | 0.284684 | 0.294105 | 0.309406 | 0.294038 | 0.307518 | 0.299637 |
| daily_yoy_ndt.total_inside_sales | -0.069297 | -0.196982 | 0.088361 | -0.227230 | -0.146207 | -0.092304 | -0.138241 | -0.121629 | -0.035234 | -0.141035 | -0.133918 | -0.173500 | -0.183157 | -0.099636 | -0.131280 | -0.062838 | -0.074664 | 0.313007 | 0.317313 | 0.317313 | 0.313007 | 0.221174 | 0.181059 | 0.240171 | 0.196257 | -0.103979 | 0.053392 | 1.000000 | 0.883514 | 0.403600 | 0.281238 | 0.058576 | 0.075577 | 0.073476 | 0.077074 | 0.039160 | 0.041226 | 0.041159 | 0.092610 | 0.086198 | 0.085694 | 0.088669 | 0.084326 | 0.087718 | 0.094755 | 0.094119 | 0.080625 | 0.093671 | 0.083031 | 0.089959 |
| daily_yoy_ndt.total_food_service | 0.026521 | -0.108011 | -0.018858 | -0.152398 | -0.364787 | -0.194294 | -0.224750 | -0.306799 | -0.092283 | -0.231706 | -0.338342 | -0.251672 | -0.319879 | -0.293003 | -0.198665 | -0.167597 | 0.048519 | 0.475144 | 0.508713 | 0.508713 | 0.475144 | 0.271181 | 0.285151 | 0.290886 | 0.331075 | -0.058388 | 0.053343 | 0.883514 | 1.000000 | 0.548764 | 0.181824 | 0.027738 | 0.018505 | 0.007466 | 0.024667 | 0.015636 | 0.009636 | 0.006103 | 0.048298 | 0.044379 | 0.044179 | 0.047726 | 0.043048 | 0.046726 | 0.056599 | 0.049347 | 0.038730 | 0.047182 | 0.039243 | 0.046367 |
| diesel_y | -0.045712 | 0.075498 | 0.050270 | 0.144365 | -0.412535 | -0.059426 | -0.182090 | -0.357292 | 0.107757 | -0.321937 | -0.373938 | -0.163803 | -0.265728 | -0.300413 | -0.105679 | -0.264862 | 0.256609 | 0.465868 | 0.641990 | 0.641990 | 0.465868 | 0.564103 | 0.618443 | 0.471961 | 0.553721 | -0.166502 | 0.019301 | 0.403600 | 0.548764 | 1.000000 | 0.134801 | 0.064079 | 0.069725 | 0.062459 | 0.072611 | 0.083125 | 0.082192 | 0.080170 | -0.011455 | -0.017186 | -0.013697 | -0.019769 | -0.020401 | -0.017175 | -0.009395 | -0.010282 | -0.019956 | -0.010181 | -0.018122 | -0.013885 |
| unleaded | -0.221243 | -0.069720 | 0.201761 | -0.073563 | -0.128380 | -0.057403 | -0.013171 | -0.116141 | -0.000437 | -0.015916 | -0.168466 | -0.208143 | 0.099381 | -0.203366 | -0.240116 | 0.004000 | 0.039938 | 0.003051 | -0.036291 | -0.036291 | 0.003051 | 0.135531 | 0.053281 | 0.030015 | 0.154516 | -0.193344 | 0.027099 | 0.281238 | 0.181824 | 0.134801 | 1.000000 | 0.135330 | 0.142817 | 0.144860 | 0.129606 | 0.107321 | 0.116986 | 0.123939 | 0.022283 | 0.016970 | 0.011608 | 0.016518 | 0.025357 | 0.026163 | 0.004941 | 0.025023 | 0.025600 | 0.019772 | 0.022922 | 0.021207 |
| Crude Oil Cushing, OK WTI Spot Price FOB (Dollars per Barrel) | -0.015587 | 0.004553 | -0.005757 | -0.057014 | 0.058864 | -0.006478 | 0.033294 | 0.083356 | -0.020377 | 0.013977 | 0.039332 | -0.005632 | 0.076541 | 0.028706 | -0.029025 | 0.059577 | 0.029964 | -0.113783 | -0.019848 | -0.019848 | -0.113783 | 0.004687 | 0.017835 | 0.036361 | 0.003834 | 0.063396 | -0.172594 | 0.058576 | 0.027738 | 0.064079 | 0.135330 | 1.000000 | 0.674169 | 0.673907 | 0.604378 | 0.611057 | 0.636820 | 0.638360 | -0.063385 | -0.073693 | -0.071707 | -0.078269 | -0.067202 | -0.057912 | -0.070400 | -0.057983 | -0.069206 | -0.063029 | -0.072620 | -0.066540 |
| New York Harbor Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) | 0.111766 | -0.022697 | -0.133397 | -0.074864 | 0.036827 | -0.042594 | 0.056214 | 0.078264 | -0.049467 | 0.015495 | 0.013689 | -0.038843 | 0.112430 | -0.001558 | -0.065856 | 0.110350 | 0.083021 | -0.153564 | -0.024008 | -0.024008 | -0.153564 | -0.020455 | -0.000086 | 0.005993 | 0.036784 | 0.175865 | -0.152188 | 0.075577 | 0.018505 | 0.069725 | 0.142817 | 0.674169 | 1.000000 | 0.978015 | 0.845447 | 0.860237 | 0.874810 | 0.883220 | -0.037491 | -0.053359 | -0.060280 | -0.047545 | -0.035668 | -0.016172 | -0.043990 | -0.027996 | -0.043118 | -0.044360 | -0.051456 | -0.041725 |
| U.S. Gulf Coast Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) | 0.055195 | -0.012559 | -0.078795 | -0.072173 | 0.056064 | -0.022153 | 0.044057 | 0.094329 | -0.035299 | 0.010410 | 0.030691 | -0.021156 | 0.100207 | 0.014971 | -0.049700 | 0.092064 | 0.063797 | -0.150076 | -0.027927 | -0.027927 | -0.150076 | -0.011776 | 0.009639 | 0.019120 | 0.019101 | 0.133423 | -0.219238 | 0.073476 | 0.007466 | 0.062459 | 0.144860 | 0.673907 | 0.978015 | 1.000000 | 0.862013 | 0.831065 | 0.860551 | 0.875596 | -0.055409 | -0.073366 | -0.080131 | -0.068751 | -0.056154 | -0.035038 | -0.065366 | -0.044320 | -0.062388 | -0.061432 | -0.070994 | -0.060356 |
| Los Angeles Reformulated RBOB Regular Gasoline Spot Price (Dollars per Gallon) | 0.116480 | -0.021243 | -0.141029 | -0.073249 | 0.016890 | -0.052771 | 0.054176 | 0.058005 | -0.050152 | 0.014994 | -0.000893 | -0.045799 | 0.100670 | -0.013134 | -0.068488 | 0.095448 | 0.089822 | -0.135411 | -0.013155 | -0.013155 | -0.135411 | -0.018392 | -0.001233 | -0.004440 | 0.040352 | 0.170910 | -0.139840 | 0.077074 | 0.024667 | 0.072611 | 0.129606 | 0.604378 | 0.845447 | 0.862013 | 1.000000 | 0.784389 | 0.811903 | 0.859860 | -0.033419 | -0.044149 | -0.046522 | -0.045314 | -0.034479 | -0.018812 | -0.031855 | -0.026727 | -0.040917 | -0.037608 | -0.045372 | -0.036598 |
| New York Harbor Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) | 0.219266 | -0.043348 | -0.236430 | -0.091553 | -0.008804 | -0.085423 | 0.080157 | 0.036619 | -0.070121 | 0.024507 | -0.017328 | -0.074726 | 0.126263 | -0.026863 | -0.097558 | 0.132509 | 0.116432 | -0.140479 | 0.000052 | 0.000052 | -0.140479 | -0.028916 | -0.011006 | -0.016323 | 0.066638 | 0.243303 | -0.053504 | 0.039160 | 0.015636 | 0.083125 | 0.107321 | 0.611057 | 0.860237 | 0.831065 | 0.784389 | 1.000000 | 0.967929 | 0.943774 | -0.046209 | -0.055854 | -0.057853 | -0.056623 | -0.046295 | -0.031156 | -0.043890 | -0.041328 | -0.051286 | -0.051341 | -0.058691 | -0.049197 |
| U.S. Gulf Coast Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) | 0.215758 | -0.041276 | -0.236312 | -0.092602 | -0.005068 | -0.083834 | 0.078923 | 0.041878 | -0.070493 | 0.023652 | -0.015423 | -0.072745 | 0.129549 | -0.025627 | -0.097000 | 0.133461 | 0.118120 | -0.148726 | -0.003804 | -0.003804 | -0.148726 | -0.028709 | -0.012266 | -0.016243 | 0.067118 | 0.243848 | -0.097675 | 0.041226 | 0.009636 | 0.082192 | 0.116986 | 0.636820 | 0.874810 | 0.860551 | 0.811903 | 0.967929 | 1.000000 | 0.971855 | -0.049164 | -0.058611 | -0.060057 | -0.062383 | -0.051056 | -0.034437 | -0.046849 | -0.044784 | -0.055425 | -0.053649 | -0.061708 | -0.052233 |
| Los Angeles, CA Ultra-Low Sulfur CARB Diesel Spot Price (Dollars per Gallon) | 0.201209 | -0.038229 | -0.224180 | -0.094888 | 0.000400 | -0.082242 | 0.074755 | 0.048122 | -0.071627 | 0.021122 | -0.010920 | -0.070289 | 0.127420 | -0.022897 | -0.096472 | 0.129742 | 0.118398 | -0.152496 | -0.008115 | -0.008115 | -0.152496 | -0.024698 | -0.007359 | -0.013106 | 0.062515 | 0.233893 | -0.092032 | 0.041159 | 0.006103 | 0.080170 | 0.123939 | 0.638360 | 0.883220 | 0.875596 | 0.859860 | 0.943774 | 0.971855 | 1.000000 | -0.037569 | -0.048356 | -0.050506 | -0.050479 | -0.039152 | -0.021859 | -0.038047 | -0.032214 | -0.042693 | -0.042552 | -0.049768 | -0.040794 |
| Utah | -0.055653 | 0.035205 | 0.056122 | 0.039701 | 0.007527 | 0.000305 | -0.033512 | -0.018028 | -0.010545 | -0.034467 | 0.015755 | 0.000544 | -0.022023 | 0.015653 | -0.016554 | -0.024250 | -0.063785 | 0.062098 | -0.023404 | -0.023404 | 0.062098 | 0.022068 | -0.033451 | -0.020618 | -0.056359 | -0.091511 | 0.296885 | 0.092610 | 0.048298 | -0.011455 | 0.022283 | -0.063385 | -0.037491 | -0.055409 | -0.033419 | -0.046209 | -0.049164 | -0.037569 | 1.000000 | 0.994015 | 0.983294 | 0.988611 | 0.989559 | 0.986871 | 0.982192 | 0.996262 | 0.994831 | 0.996812 | 0.996090 | 0.999536 |
| Idaho | -0.030094 | 0.024390 | 0.032672 | 0.045855 | -0.000648 | -0.005055 | -0.027646 | -0.026489 | -0.012762 | -0.031419 | 0.008787 | -0.006572 | -0.015966 | 0.007934 | -0.022529 | -0.013359 | -0.061655 | 0.061308 | -0.024019 | -0.024019 | 0.061308 | 0.014460 | -0.041886 | -0.028287 | -0.047700 | -0.070688 | 0.301069 | 0.086198 | 0.044379 | -0.017186 | 0.016970 | -0.073693 | -0.053359 | -0.073366 | -0.044149 | -0.055854 | -0.058611 | -0.048356 | 0.994015 | 1.000000 | 0.991481 | 0.987632 | 0.985086 | 0.974949 | 0.991994 | 0.983237 | 0.987816 | 0.993173 | 0.995981 | 0.996459 |
| Oregon | -0.047792 | 0.027531 | 0.045896 | 0.052789 | 0.005157 | -0.002466 | -0.023350 | -0.018848 | -0.012346 | -0.026358 | 0.013503 | -0.003293 | -0.012388 | 0.012450 | -0.019123 | -0.009183 | -0.060413 | 0.063067 | -0.021814 | -0.021814 | 0.063067 | 0.023264 | -0.032590 | -0.023346 | -0.048028 | -0.078922 | 0.292618 | 0.085694 | 0.044179 | -0.013697 | 0.011608 | -0.071707 | -0.060280 | -0.080131 | -0.046522 | -0.057853 | -0.060057 | -0.050506 | 0.983294 | 0.991481 | 1.000000 | 0.968751 | 0.963542 | 0.948551 | 0.992555 | 0.968598 | 0.970069 | 0.991582 | 0.985726 | 0.985832 |
| South Dakota | -0.020774 | 0.018392 | 0.024845 | 0.042652 | -0.003295 | -0.003297 | -0.033782 | -0.028190 | -0.011296 | -0.034813 | 0.006831 | -0.004501 | -0.022670 | 0.005452 | -0.020469 | -0.020458 | -0.065379 | 0.060413 | -0.026227 | -0.026227 | 0.060413 | 0.007533 | -0.047686 | -0.033131 | -0.050728 | -0.065077 | 0.306534 | 0.088669 | 0.047726 | -0.019769 | 0.016518 | -0.078269 | -0.047545 | -0.068751 | -0.045314 | -0.056623 | -0.062383 | -0.050479 | 0.988611 | 0.987632 | 0.968751 | 1.000000 | 0.997123 | 0.985956 | 0.975940 | 0.985092 | 0.992629 | 0.979743 | 0.991718 | 0.990779 |
| Nebraska | -0.030016 | 0.025206 | 0.033402 | 0.036869 | -0.000368 | -0.001969 | -0.034434 | -0.025024 | -0.010555 | -0.036313 | 0.008799 | -0.001276 | -0.021057 | 0.008033 | -0.018873 | -0.022092 | -0.067090 | 0.052227 | -0.029291 | -0.029291 | 0.052227 | 0.004894 | -0.049409 | -0.032099 | -0.053586 | -0.067801 | 0.305197 | 0.084326 | 0.043048 | -0.020401 | 0.025357 | -0.067202 | -0.035668 | -0.056154 | -0.034479 | -0.046295 | -0.051056 | -0.039152 | 0.989559 | 0.985086 | 0.963542 | 0.997123 | 1.000000 | 0.991753 | 0.969631 | 0.988620 | 0.996409 | 0.978700 | 0.991732 | 0.991093 |
| New Mexico | -0.021913 | 0.034425 | 0.025374 | 0.036282 | -0.002960 | -0.008584 | -0.030729 | -0.026624 | -0.016387 | -0.037008 | 0.005404 | -0.007784 | -0.017104 | 0.005040 | -0.024887 | -0.019137 | -0.063591 | 0.051933 | -0.027887 | -0.027887 | 0.051933 | 0.006492 | -0.048617 | -0.038713 | -0.053713 | -0.061874 | 0.292941 | 0.087718 | 0.046726 | -0.017175 | 0.026163 | -0.057912 | -0.016172 | -0.035038 | -0.018812 | -0.031156 | -0.034437 | -0.021859 | 0.986871 | 0.974949 | 0.948551 | 0.985956 | 0.991753 | 1.000000 | 0.956490 | 0.992738 | 0.993258 | 0.973159 | 0.982690 | 0.986419 |
| Washington | 0.008629 | 0.016107 | -0.007749 | 0.051183 | -0.014709 | -0.018011 | -0.019198 | -0.038876 | -0.021654 | -0.027705 | -0.003854 | -0.019537 | -0.010304 | -0.005917 | -0.033577 | -0.000141 | -0.049709 | 0.070434 | -0.012093 | -0.012093 | 0.070434 | 0.018346 | -0.037480 | -0.033646 | -0.037267 | -0.040810 | 0.284684 | 0.094755 | 0.056599 | -0.009395 | 0.004941 | -0.070400 | -0.043990 | -0.065366 | -0.031855 | -0.043890 | -0.046849 | -0.038047 | 0.982192 | 0.991994 | 0.992555 | 0.975940 | 0.969631 | 0.956490 | 1.000000 | 0.968765 | 0.969322 | 0.985508 | 0.982607 | 0.985289 |
| Arizona | -0.067927 | 0.043985 | 0.066775 | 0.039142 | 0.010503 | 0.002267 | -0.036744 | -0.014272 | -0.009949 | -0.037357 | 0.016897 | 0.003551 | -0.025185 | 0.017064 | -0.013186 | -0.031042 | -0.068920 | 0.061402 | -0.023134 | -0.023134 | 0.061402 | 0.024479 | -0.029904 | -0.019770 | -0.060489 | -0.098815 | 0.294105 | 0.094119 | 0.049347 | -0.010282 | 0.025023 | -0.057983 | -0.027996 | -0.044320 | -0.026727 | -0.041328 | -0.044784 | -0.032214 | 0.996262 | 0.983237 | 0.968598 | 0.985092 | 0.988620 | 0.992738 | 0.968765 | 1.000000 | 0.994290 | 0.990457 | 0.989016 | 0.994583 |
| Colorado | -0.051333 | 0.033964 | 0.053654 | 0.037052 | 0.007582 | 0.003971 | -0.036112 | -0.017240 | -0.007521 | -0.036772 | 0.015937 | 0.005659 | -0.022489 | 0.015889 | -0.012388 | -0.026716 | -0.068244 | 0.051048 | -0.031036 | -0.031036 | 0.051048 | 0.010013 | -0.044481 | -0.028852 | -0.059211 | -0.084240 | 0.309406 | 0.080625 | 0.038730 | -0.019956 | 0.025600 | -0.069206 | -0.043118 | -0.062388 | -0.040917 | -0.051286 | -0.055425 | -0.042693 | 0.994831 | 0.987816 | 0.970069 | 0.992629 | 0.996409 | 0.993258 | 0.969322 | 0.994290 | 1.000000 | 0.986566 | 0.995633 | 0.995398 |
| Nevada | -0.073312 | 0.037200 | 0.071321 | 0.043851 | 0.016895 | 0.005010 | -0.029606 | -0.008257 | -0.008001 | -0.028832 | 0.023099 | 0.004964 | -0.018437 | 0.022840 | -0.012219 | -0.021298 | -0.066877 | 0.061711 | -0.024134 | -0.024134 | 0.061711 | 0.027694 | -0.027734 | -0.014284 | -0.058401 | -0.102051 | 0.294038 | 0.093671 | 0.047182 | -0.010181 | 0.019772 | -0.063029 | -0.044360 | -0.061432 | -0.037608 | -0.051341 | -0.053649 | -0.042552 | 0.996812 | 0.993173 | 0.991582 | 0.979743 | 0.978700 | 0.973159 | 0.985508 | 0.990457 | 0.986566 | 1.000000 | 0.992810 | 0.996515 |
| Wyoming | -0.052838 | 0.029308 | 0.055134 | 0.042974 | 0.007053 | 0.004445 | -0.033132 | -0.018491 | -0.006230 | -0.034716 | 0.015149 | 0.003100 | -0.019923 | 0.014547 | -0.014626 | -0.022066 | -0.068085 | 0.056074 | -0.027897 | -0.027897 | 0.056074 | 0.014832 | -0.040755 | -0.024114 | -0.055087 | -0.084101 | 0.307518 | 0.083031 | 0.039243 | -0.018122 | 0.022922 | -0.072620 | -0.051456 | -0.070994 | -0.045372 | -0.058691 | -0.061708 | -0.049768 | 0.996090 | 0.995981 | 0.985726 | 0.991718 | 0.991732 | 0.982690 | 0.982607 | 0.989016 | 0.995633 | 0.992810 | 1.000000 | 0.997811 |
| W. Monthly Average Temp | -0.048845 | 0.032454 | 0.049899 | 0.041602 | 0.005740 | -0.000492 | -0.031946 | -0.019699 | -0.010729 | -0.033716 | 0.014076 | -0.000620 | -0.020090 | 0.013711 | -0.017574 | -0.021285 | -0.064356 | 0.060479 | -0.024467 | -0.024467 | 0.060479 | 0.018788 | -0.036869 | -0.023736 | -0.054656 | -0.084870 | 0.299637 | 0.089959 | 0.046367 | -0.013885 | 0.021207 | -0.066540 | -0.041725 | -0.060356 | -0.036598 | -0.049197 | -0.052233 | -0.040794 | 0.999536 | 0.996459 | 0.985832 | 0.990779 | 0.991093 | 0.986419 | 0.985289 | 0.994583 | 0.995398 | 0.996515 | 0.997811 | 1.000000 |
Now we need to convert categorical variables to numeric values for future analysis.
df_cat = data.select_dtypes(include = 'object')
df_cat.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 13542 entries, 0 to 13541 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 lottery 13542 non-null object 1 freal 13542 non-null object 2 bonfire_grill 13542 non-null object 3 pizza 13542 non-null object 4 cinnabon 13542 non-null object 5 ethanol_free 13542 non-null object 6 hi_flow_lanes 13542 non-null object 7 rv_lanes 13542 non-null object 8 hi_flow_rv_lanes 13542 non-null object 9 def 13542 non-null object 10 cat_scales 13542 non-null object 11 rv_dumps 13542 non-null object 12 propane 13542 non-null object 13 traditional_forecourt_layout 13542 non-null object 14 traditional_forecourt_stack_type 13542 non-null object 15 rv_lanes_layout 13542 non-null object 16 rv_lanes_stack_type 13542 non-null object 17 hi_flow_lanes_layout 13542 non-null object 18 hi_flow_lanes_stack_type 13542 non-null object 19 hi_flow_rv_lanes_layout 13542 non-null object 20 hi_flow_rv_lanes_stack_type 13542 non-null object 21 capital_projects.soft_opening_date 13542 non-null object 22 calendar.day_of_week 13542 non-null object 23 calendar_information.type_of_day 13542 non-null object 24 Holidays 13542 non-null object dtypes: object(25) memory usage: 2.7+ MB
col_ignore = ['capital_projects.soft_opening_date', 'calendar.calendar_day_date']
df_ignore = data[col_ignore]
data = data.drop(col_ignore, axis = 1)
data = pd.get_dummies(data, drop_first = True, dtype = float)
data['soft_opening_date'] = df_ignore['capital_projects.soft_opening_date']
data['calendar_day_date'] = df_ignore['calendar.calendar_day_date']
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 13542 entries, 0 to 13541 Data columns (total 100 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 open_year 13542 non-null int64 1 square_feet 13542 non-null int64 2 years_since_last_project 13542 non-null int64 3 parking_spaces 13542 non-null int64 4 x1_mile_pop 13542 non-null int64 5 x1_mile_emp 13542 non-null int64 6 x1_mile_income 13542 non-null int64 7 x1_2_mile_pop 13542 non-null int64 8 x1_2_mile_emp 13542 non-null int64 9 x1_2_mile_income 13542 non-null int64 10 x5_min_pop 13542 non-null int64 11 x5_min_emp 13542 non-null int64 12 x5_min_inc 13542 non-null int64 13 x7_min_pop 13542 non-null int64 14 x7_min_emp 13542 non-null int64 15 x7_min_inc 13542 non-null int64 16 traditional_forecourt_fueling_positions 13542 non-null int64 17 rv_lanes_fueling_positions 13542 non-null int64 18 hi_flow_lanes_fueling_positions 13542 non-null int64 19 hi_flow_lanes_fueling_positions_2 13542 non-null int64 20 rv_lanes_fueling_positions_2 13542 non-null int64 21 mens_toilet_count 13542 non-null int64 22 mens_urinal_count 13542 non-null int64 23 womens_toilet_count 13542 non-null int64 24 womens_sink_count 13542 non-null int64 25 site_id_msba 13542 non-null int64 26 calendar.fiscal_week_id_for_year 13542 non-null int64 27 daily_yoy_ndt.total_inside_sales 13542 non-null float64 28 daily_yoy_ndt.total_food_service 13542 non-null float64 29 diesel_y 13542 non-null float64 30 unleaded 13542 non-null float64 31 Crude Oil Cushing, OK WTI Spot Price FOB (Dollars per Barrel) 13542 non-null float64 32 New York Harbor Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) 13542 non-null float64 33 U.S. Gulf Coast Conventional Gasoline Regular Spot Price FOB (Dollars per Gallon) 13542 non-null float64 34 Los Angeles Reformulated RBOB Regular Gasoline Spot Price (Dollars per Gallon) 13542 non-null float64 35 New York Harbor Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) 13542 non-null float64 36 U.S. Gulf Coast Ultra-Low Sulfur No 2 Diesel Spot Price (Dollars per Gallon) 13542 non-null float64 37 Los Angeles, CA Ultra-Low Sulfur CARB Diesel Spot Price (Dollars per Gallon) 13542 non-null float64 38 Utah 13542 non-null float64 39 Idaho 13542 non-null float64 40 Oregon 13542 non-null float64 41 South Dakota 13542 non-null float64 42 Nebraska 13542 non-null float64 43 New Mexico 13542 non-null float64 44 Washington 13542 non-null float64 45 Arizona 13542 non-null float64 46 Colorado 13542 non-null float64 47 Nevada 13542 non-null float64 48 Wyoming 13542 non-null float64 49 W. Monthly Average Temp 13542 non-null float64 50 lottery_Yes 13542 non-null float64 51 freal_Yes 13542 non-null float64 52 bonfire_grill_Yes 13542 non-null float64 53 pizza_Yes 13542 non-null float64 54 cinnabon_Yes 13542 non-null float64 55 ethanol_free_Yes 13542 non-null float64 56 hi_flow_lanes_Yes 13542 non-null float64 57 rv_lanes_Yes 13542 non-null float64 58 hi_flow_rv_lanes_Yes 13542 non-null float64 59 def_Yes 13542 non-null float64 60 cat_scales_Yes 13542 non-null float64 61 rv_dumps_Yes 13542 non-null float64 62 propane_Yes 13542 non-null float64 63 traditional_forecourt_layout_Stack 13542 non-null float64 64 traditional_forecourt_stack_type_Large 13542 non-null float64 65 traditional_forecourt_stack_type_None 13542 non-null float64 66 rv_lanes_layout_None 13542 non-null float64 67 rv_lanes_layout_Stack 13542 non-null float64 68 rv_lanes_stack_type_None 13542 non-null float64 69 hi_flow_lanes_layout_None 13542 non-null float64 70 hi_flow_lanes_layout_Stack 13542 non-null float64 71 hi_flow_lanes_stack_type_None 13542 non-null float64 72 hi_flow_rv_lanes_layout_In-Line 13542 non-null float64 73 hi_flow_rv_lanes_layout_None 13542 non-null float64 74 hi_flow_rv_lanes_layout_Stack 13542 non-null float64 75 hi_flow_rv_lanes_stack_type_None 13542 non-null float64 76 calendar.day_of_week_Monday 13542 non-null float64 77 calendar.day_of_week_Saturday 13542 non-null float64 78 calendar.day_of_week_Sunday 13542 non-null float64 79 calendar.day_of_week_Thursday 13542 non-null float64 80 calendar.day_of_week_Tuesday 13542 non-null float64 81 calendar.day_of_week_Wednesday 13542 non-null float64 82 calendar_information.type_of_day_WEEKEND 13542 non-null float64 83 Holidays_Christmas Day (Observed) 13542 non-null float64 84 Holidays_Columbus Day 13542 non-null float64 85 Holidays_Independence Day 13542 non-null float64 86 Holidays_Independence Day (Observed) 13542 non-null float64 87 Holidays_Juneteenth National Independence Day 13542 non-null float64 88 Holidays_Juneteenth National Independence Day (Observed) 13542 non-null float64 89 Holidays_Labor Day 13542 non-null float64 90 Holidays_Martin Luther King Jr. Day 13542 non-null float64 91 Holidays_Memorial Day 13542 non-null float64 92 Holidays_New Year's Day 13542 non-null float64 93 Holidays_New Year's Day (Observed) 13542 non-null float64 94 Holidays_Not Holiday 13542 non-null float64 95 Holidays_Thanksgiving 13542 non-null float64 96 Holidays_Veterans Day 13542 non-null float64 97 Holidays_Washington's Birthday 13542 non-null float64 98 soft_opening_date 13542 non-null object 99 calendar_day_date 13542 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(71), int64(27), object(1) memory usage: 10.4+ MB
data.head()
| open_year | square_feet | years_since_last_project | parking_spaces | x1_mile_pop | x1_mile_emp | x1_mile_income | x1_2_mile_pop | x1_2_mile_emp | x1_2_mile_income | ... | Holidays_Martin Luther King Jr. Day | Holidays_Memorial Day | Holidays_New Year's Day | Holidays_New Year's Day (Observed) | Holidays_Not Holiday | Holidays_Thanksgiving | Holidays_Veterans Day | Holidays_Washington's Birthday | soft_opening_date | calendar_day_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021 | 5046 | 2 | 38 | 4046 | 3648 | 43435 | 556 | 642 | 45678 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1/12/2021 | 2021-01-17 |
| 1 | 2021 | 5046 | 2 | 38 | 4046 | 3648 | 43435 | 556 | 642 | 45678 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1/12/2021 | 2021-01-20 |
| 2 | 2021 | 5046 | 2 | 38 | 4046 | 3648 | 43435 | 556 | 642 | 45678 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1/12/2021 | 2021-01-21 |
| 3 | 2021 | 5046 | 2 | 38 | 4046 | 3648 | 43435 | 556 | 642 | 45678 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1/12/2021 | 2021-01-24 |
| 4 | 2021 | 5046 | 2 | 38 | 4046 | 3648 | 43435 | 556 | 642 | 45678 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1/12/2021 | 2021-01-29 |
5 rows × 100 columns
# Importing the necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Creating the dataframes with raw string literals
time = pd.read_csv(r'C:\Users\jusha\Documents\IS 6813 MSBA Final Capstone Maverik Project Folder\time_series_data_msba.csv')
qual = pd.read_csv(r'C:\Users\jusha\Documents\IS 6813 MSBA Final Capstone Maverik Project Folder\qualitative_data_msba.csv')
# List of categorical columns you want to examine
categorical_columns = [
'lottery', 'freal', 'bonfire_grill', 'pizza', 'cinnabon', 'godfather_s_pizza',
'ethanol_free', 'diesel', 'hi_flow_lanes', 'rv_lanes', 'hi_flow_rv_lanes', 'def',
'cat_scales', 'car_wash', 'ev_charging', 'rv_dumps', 'propane',
'traditional_forecourt_layout', 'traditional_forecourt_stack_type',
'rv_lanes_layout', 'rv_lanes_stack_type',
'hi_flow_lanes_layout', 'hi_flow_lanes_stack_type',
'hi_flow_rv_lanes_layout', 'hi_flow_rv_lanes_stack_type',
'non_24_hour', 'self_check_out'
]
# Create a dictionary to store the counts for each feature
feature_counts = {}
# Count the occurrences of categories that are not "No" or "NA" for each column
for col in categorical_columns:
filtered_data = qual[~qual[col].isin(['No', 'N/A', 'None'])] # Filter out "No" and "NA"
category_counts = filtered_data[col].count()
feature_counts[col] = category_counts
# Create a DataFrame to store the counts
counts_df = pd.DataFrame(feature_counts, index=['Count'])
# Sort the counts in descending order to determine the most offered features
sorted_counts = counts_df.T.sort_values(by='Count', ascending=True)
# Create a horizontal bar graph to visualize the most offered features
plt.figure(figsize=(10, 6)) # Adjust figure size for horizontal bars
sorted_counts.plot(kind='barh', legend=False)
plt.title('Categorical Feature Counts (Excluding "No" and "NA" and "None") Across All Locations')
plt.xlabel('Total Count')
plt.ylabel('Categories')
plt.tight_layout()
# Show the plot
plt.show()
<Figure size 1000x600 with 0 Axes>
In this visual we have chosen to count that values that exclude No, NA, or None. This is because by not counting the values they are counted as a 0 and do not throw off the pattern of the descending visualization.
Knowing that the different products can help us see how they can play a role in our model. It is interesting to see that the Freal is offered widely among stores. Many of the products/services offered through Maverik are related large vehicle services such as diesel or RV related products. When we develop our model for future stores, it would be important for the model to predict the sale of the gallons of diesel fuel with the store offering RV or diesel usage vehicle related services within their store.
# List of numeric columns you want to analyze, including 'open_year'
numeric_columns = [
'square_feet', 'front_door_count', 'years_since_last_project', 'parking_spaces',
'x1_mile_pop', 'x1_mile_emp', 'x1_mile_income',
'x1_2_mile_pop', 'x1_2_mile_emp', 'x1_2_mile_income',
'x5_min_pop', 'x5_min_emp', 'x5_min_inc',
'x7_min_pop', 'x7_min_emp', 'x7_min_inc',
'traditional_forecourt_fueling_positions',
'rv_lanes_fueling_positions',
'hi_flow_lanes_fueling_positions',
'hi_flow_lanes_fueling_positions_2',
'rv_lanes_fueling_positions_2',
'mens_toilet_count', 'mens_urinal_count', 'womens_toilet_count', 'womens_sink_count',
'open_year' # Include 'open_year' in the numeric analysis
]
# Basic summary statistics for numeric columns
summary_stats = qual[numeric_columns].describe()
# Histograms for numeric columns
plt.figure(figsize=(16, 12))
for i, col in enumerate(numeric_columns, 1):
plt.subplot(5, 6, i) # Adjust the number of rows and columns as needed
qual[col].hist()
plt.title(col)
plt.xlabel(col)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
# Box plots for numeric columns (to visualize distribution and outliers)
plt.figure(figsize=(16, 6))
qual[numeric_columns].boxplot()
plt.title('Box Plots of Numeric Columns')
plt.ylabel('Value')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
From these visualizations we can learn a couple of patterns about Maverik stores.
The weights designated of importance to a potential customer. These weights are just estimates of what a consumer may feel. Exact could differ.
There are 26 Numeric Features and 27 Categorical Features
All features weights will be summed to 1.
# Define weights for each variable using the variable_weights dictionary
variable_weights = {
'years_since_last_project': 0.075,
'ev_charging': 0.075,
'hi_flow_lanes_layout': 0.075,
'womens_toilet_count': 0.075,
'rv_lanes': 0.08333333333,
'x1_2_mile_income': 0.08333333333,
'hi_flow_rv_lanes_layout': 0.08333333333,
'bonfire_grill': 0.02857142857,
'godfather_s_pizza': 0.02857142857,
'x1_mile_income': 0.02857142857,
'x5_min_inc': 0.02857142857,
'rv_lanes_fueling_positions': 0.02857142857,
'non_24_hour': 0.02857142857,
'open_year': 0.02857142857,
'parking_spaces': 0.01071428571,
'cinnabon': 0.01071428571,
'hi_flow_lanes': 0.01071428571,
'def': 0.01071428571,
'propane': 0.01071428571,
'x1_mile_pop': 0.01071428571,
'x5_min_pop': 0.01071428571,
'x7_min_inc': 0.01071428571,
'traditional_forecourt_fueling_positions': 0.01071428571,
'rv_lanes_layout': 0.01071428571,
'hi_flow_lanes_fueling_positions_2': 0.01071428571,
'self_check_out': 0.01071428571,
'womens_sink_count': 0.01071428571,
'square_feet': 0.003846153846,
'front_door_count': 0.003846153846,
'lottery': 0.003846153846,
'freal': 0.003846153846,
'pizza': 0.003846153846,
'ethanol_free': 0.003846153846,
'diesel': 0.003846153846,
'hi_flow_rv_lanes': 0.003846153846,
'cat_scales': 0.003846153846,
'car_wash': 0.003846153846,
'rv_dumps': 0.003846153846,
'x1_mile_emp': 0.003846153846,
'x1_2_mile_pop': 0.003846153846,
'x1_2_mile_emp': 0.003846153846,
'x5_min_emp': 0.003846153846,
'x7_min_pop': 0.003846153846,
'x7_min_emp': 0.003846153846,
'traditional_forecourt_layout': 0.003846153846,
'traditional_forecourt_stack_type': 0.003846153846,
'rv_lanes_stack_type': 0.003846153846,
'hi_flow_lanes_fueling_positions': 0.003846153846,
'hi_flow_lanes_stack_type': 0.003846153846,
'rv_lanes_fueling_positions_2': 0.003846153846,
'hi_flow_rv_lanes_stack_type': 0.003846153846,
'mens_toilet_count': 0.003846153846,
'mens_urinal_count': 0.003846153846,
}
# Replace "N/A" and "None" with NaN in the DataFrame
qual = qual.replace(["N/A", "None"], np.nan)
# Create a list of numeric columns to be scaled
numeric_columns_to_scale = [col for col in variable_weights.keys() if col in qual.columns and qual[col].dtype in ['int64', 'float64']]
# Normalize numeric variables using Min-Max scaling for selected columns, while ignoring NaN values
for col in numeric_columns_to_scale:
qual[col] = (qual[col] - qual[col].min()) / (qual[col].max() - qual[col].min())
# Calculate composite scores for each unique ID, ignoring NaN values in individual columns
qual['composite_score'] = qual[numeric_columns_to_scale].sum(axis=1, skipna=True)
# Rank unique IDs based on composite scores
ranked_df = qual.sort_values(by='composite_score', ascending=False)
# Display the ranked IDs along with the site_id_msba and composite_score
print(ranked_df[['site_id_msba', 'composite_score']])
# Create the bar chart
plt.figure(figsize=(10, 8)) # Adjust the figure size as needed
plt.barh(range(len(ranked_df)), ranked_df['composite_score'])
plt.yticks(range(len(ranked_df)), ranked_df['site_id_msba']) # Set Site IDs as labels
plt.xlabel('Composite Score')
plt.ylabel('Site ID')
plt.title('Ranked Sites by Composite Score (Descending)')
plt.gca().invert_yaxis() # Invert the y-axis to show higher scores at the top
plt.tight_layout()
# Display the bar chart
plt.show()
site_id_msba composite_score 13 22715 13.924838 14 22750 12.492150 11 22645 12.091841 6 22330 12.074806 1 21980 11.895294 12 22680 11.496536 21 23240 11.148606 9 22540 11.131945 32 23905 11.089084 23 23380 10.577977 29 23730 10.482678 18 22890 10.329440 8 22505 10.273732 31 23835 10.125066 10 22575 10.067175 0 21560 10.024739 27 23555 9.657139 19 22925 9.654990 15 22785 9.597957 28 23660 9.256085 25 23450 9.241231 2 22015 9.170905 7 22400 9.075692 16 22820 8.950332 36 24535 8.932631 22 23345 8.860485 35 24255 8.794727 5 22260 8.287082 30 23765 8.285760 17 22855 8.224871 3 22085 7.775957 34 24220 7.765648 24 23415 7.341054 20 23135 7.317135 4 22120 7.093811 33 24150 6.384718 26 23485 5.575620
This graph tells a very interesting story. The tier values that were used in the composite score we developed from a ranking survey. The values of none and Na were replaced for the sake of counting and computing a numeric value while still retaining the information that was presented by the data listings as None and NA. The tier listings were then given by a percentile of 100% based of the tier ranking for each feature compared to one another. The results of this visual could change depending on the weight that each preference, since the rankings were based on the average conducted from the survey.
We see that the customer preference of store in this scenario would be store 22715. When modeling and developing for future stores and creating our model, it may be important to measure how the time series data ranks for each of our 4 products of this store through time. This may be a store that we pay close attention to during the testing and training of the model, knowing that this is the store that on average could be ranked as the most desirable store for a customer to attend.
The first variables we want to examine are total insides sales combined with total food service sales in 2021 for four different stores that opened in 2021.
#Total inside and food service sales for 2021
#convert'calendar.calendar_day_date' to a date format
time_data['calendar.calendar_day_date'] = pd.to_datetime(time_data['calendar.calendar_day_date'], errors='coerce')
#filter data for the 2021
df_2021 = time_data[time_data['calendar.calendar_day_date'].dt.year == 2021]
#store ids
store_ids_to_include = [21560, 22085, 22330, 22715]
#filter data for those store ids
df_filtered = df_2021[df_2021['site_id_msba'].isin(store_ids_to_include)]
#group by 'site_id_msba', 'calendar.calendar_day_date' (month) and calculate the sum
grouped = df_filtered.groupby(['site_id_msba', df_filtered['calendar.calendar_day_date'].dt.to_period('M')]) \
[['daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service']].sum().reset_index()
#convert the period to a string for the month
grouped['calendar.calendar_day_date'] = grouped['calendar.calendar_day_date'].dt.strftime('%Y-%m')
#create a new column for 'total inside/food service sales'
grouped['total_inside_food_service_sales'] = grouped['daily_yoy_ndt.total_inside_sales'] + grouped['daily_yoy_ndt.total_food_service']
#plot the info
plt.figure(figsize=(12, 6))
for store_id in store_ids_to_include:
store_data = grouped[grouped['site_id_msba'] == store_id]
plt.plot(store_data['calendar.calendar_day_date'], store_data['total_inside_food_service_sales'], label=f'Store {store_id} Inside/Food Service Sales')
plt.xlabel('Date (Month)')
plt.ylabel('Sales')
plt.title('2021 Total Inside/Food Service Sales by Month')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
One of the biggest challenges that we will have to deal with in building a time series model is that Maverik stores experiences seasonality in their sales. You can see that there is a heavy ramp period in the first month once a store opens and then speaks in August with a slight decline after that.
The next variables I wanted to examine are diesel and unleaded gas sales combined over time. Similarly to the previous visual, there is a steep ramp of sales in the first month after opening, sales peak in July , and then decline in the winter months.
#Total gasoline sales for 2021
#convert'calendar.calendar_day_date' to a date format
time_data['calendar.calendar_day_date'] = pd.to_datetime(time_data['calendar.calendar_day_date'])
#filter data for the 2021
df_2021 = time_data[time_data['calendar.calendar_day_date'].dt.year == 2021]
#store ids
store_ids_to_include = [21560, 22085, 22330, 22715]
#filter data for those store ids
df_filtered = df_2021[df_2021['site_id_msba'].isin(store_ids_to_include)]
#group by 'site_id_msba', 'calendar.calendar_day_date' (month) and calculate the sum
grouped = df_filtered.groupby(['site_id_msba', df_filtered['calendar.calendar_day_date'].dt.to_period('M')]) \
[['diesel', 'unleaded']].sum().reset_index()
#convert the period to a string for the month
grouped['calendar.calendar_day_date'] = grouped['calendar.calendar_day_date'].dt.strftime('%Y-%m')
#create a new column for 'total gasoline sales'
grouped['total_gasoline_sales'] = grouped['diesel'] + grouped['unleaded']
#plot the info
plt.figure(figsize=(12, 6))
for store_id in store_ids_to_include:
store_data = grouped[grouped['site_id_msba'] == store_id]
plt.plot(store_data['calendar.calendar_day_date'], store_data['total_gasoline_sales'], label=f'Store {store_id} Gasoline Sales')
plt.xlabel('Date (Month)')
plt.ylabel('Gallons Sold')
plt.title('2021 Total Gallons of Gasoline Sold by Month')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Another challenge in modeling the data is that sales vary depending on the day. Below you can see that inside sales start the build throughout the week, peak on Fridays, and then drop off on the weekend.
#total inside sales by weekday 2021
#convert 'calendar.calendar_day_date' to a date format
time_data['calendar.calendar_day_date'] = pd.to_datetime(time_data['calendar.calendar_day_date'])
#filter data for the year 2021
df_2021 = time_data[time_data['calendar.calendar_day_date'].dt.year == 2021]
#store IDs to include
store_ids_to_include = [21560, 22085, 22330, 22715]
#filter data for those store IDs
df_filtered = df_2021[df_2021['site_id_msba'].isin(store_ids_to_include)]
#group by 'site_id_msba', 'calendar.day_of_week' and calculate the sum of total inside sales
grouped = df_filtered.groupby(['site_id_msba', df_filtered['calendar.calendar_day_date'].dt.day_name()]) \
[['daily_yoy_ndt.total_inside_sales']].sum().reset_index()
#dictionary to order days of the week
day_order = {
'Monday': 0,
'Tuesday': 1,
'Wednesday': 2,
'Thursday': 3,
'Friday': 4,
'Saturday': 5,
'Sunday': 6
}
#day order for sorting
grouped['day_of_week_order'] = grouped['calendar.calendar_day_date'].map(day_order)
#sort the df by day of the week order
grouped.sort_values(by='day_of_week_order', inplace=True)
#create a new column for 'total inside sales'
grouped['total_inside_sales'] = grouped['daily_yoy_ndt.total_inside_sales']
#plot the info
plt.figure(figsize=(12, 6))
for store_id in store_ids_to_include:
store_data = grouped[grouped['site_id_msba'] == store_id]
plt.plot(store_data['calendar.calendar_day_date'], store_data['total_inside_sales'], label=f'Store {store_id} Inside Sales')
plt.xlabel('Day of the Week')
plt.ylabel('Sales')
plt.title('2021 Total Inside/Food Service Sales by Day of the Week')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
The same trend applies to gasoline sales.
#gasoline sales by day of the week 2021
#convert'calendar.calendar_day_date' to a date format
time_data['calendar.calendar_day_date'] = pd.to_datetime(time_data['calendar.calendar_day_date'])
#filter data for the 2021
df_2021 = time_data[time_data['calendar.calendar_day_date'].dt.year == 2021]
#store ids
store_ids_to_include = [21560, 22085, 22330, 22715]
#filter data for those store ids
df_filtered = df_2021[df_2021['site_id_msba'].isin(store_ids_to_include)]
#group by 'site_id_msba', 'calendar.day_of_week' and calculate the sum
grouped = df_filtered.groupby(['site_id_msba', df_filtered['calendar.calendar_day_date'].dt.day_name()]) \
[['diesel', 'unleaded']].sum().reset_index()
#dict to order days of the week
day_order = {
'Monday': 0,
'Tuesday': 1,
'Wednesday': 2,
'Thursday': 3,
'Friday': 4,
'Saturday': 5,
'Sunday': 6
}
#day order for sorting
grouped['day_of_week_order'] = grouped['calendar.calendar_day_date'].map(day_order)
#sor the df by day of the week order
grouped.sort_values(by='day_of_week_order', inplace=True)
#create a new column for 'total gasoline sales'
grouped['total_gasoline_sales'] = grouped['diesel'] + grouped['unleaded']
#plot the info
plt.figure(figsize=(12, 6))
for store_id in store_ids_to_include:
store_data = grouped[grouped['site_id_msba'] == store_id]
plt.plot(store_data['calendar.calendar_day_date'], store_data['total_gasoline_sales'], label=f'Store {store_id} Gasoline Sales')
plt.xlabel('Day of the Week')
plt.ylabel('Gallons Sold')
plt.title('2021 Total Gallons of Gasoline Sold by Day of the Week')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
In this notebook we have explored and cleaned the data provided by Maverik. Overall, the data was relatively clean but still required minor adjustments. Some of the variables required that we impute some of the N/A values to 'none' to express that those features are not available at those locations and that they aren't actually missing data.
Through our analysis of variance and viewing the individual variable's distributions we determined that there were seven variables we could remove from the data due to no variance in those variables. We then looked at the holiday variable contained within the data. This variable was tracking a variety of holidays, many of which were catholic holidays. We chose to drop this column and replace it with a holiday column that tracked US Federal Holidays and the respective days observed. Viewing the relationships between sales within the store and gallons of gasoline sold over time provided us with some interesting information. August appears to be the busiest month for in-store sales, while July-October is the peak time people buy gasoline. The most surprising thing to us is the steep drop off in all sales during Saturdays and Sundays. We'll have to account for this seasonality by month and by weekday as we build out our time series model.
From the analysis of the qualitative data set, we can see that there are features such as freals, diesel fuel, diesel exhaust fuel, and self check out that are offered from most Maverik stores. The data shows us valuable information that we can use within our model to show how adding certain features to a new Maverik store could increase its prediction of the 4 metrics of diesel, unleaded gasoline, in store food services, and merchandise of the stores. We can use a composite score analysis to see which stores are on average more desirable by customers then develop stores that match the similar patterns of Maverik. Using this information can also help us build our model that will see how adding and removing features of the Maverik store could impact total sales and units sold of in store food service, merchandise, unleaded gasoline, and diesel fuel sales. Some of the Maverik qualitative store information contains possible outliers. Certain stores are located in very dense populations or have a median income level that is much higher than the average. We'll have to explore those variables further in our modeling process to see if those variables are important in predicting sales.
After reviewing the data provided by Maverik, we chose to add two other data sets. These included oil and gas prices, as well as average monthly temperatures for states in which Maverik is operating. After all the data had been combined and transformed, we obtained a final data set for modeling that contained 100 columns with 13,542 observations. This data set is now ready to be used for model training and testing, with the goal of beating Maverik's current model in prediction accuracy.
As stated above, each member of the group took responsibility to lead the analysis of a given question in the EDA. This is a synopsis of some of the tasks group members took on.
Rachel Butterfield: Rachel conducted the analysis of the time series data to see trends of the total sales for certain stores within the Maverik time series data for each of the 4 main product categories of diesel fuel, unleaded fuel gallons, in store services, and merchandise. Rachel contributed much to reviewing and editing the results summary as well as content as a whole in the notebook.
Heber Jenson:Heber led his analysis of the macro and seasonality factors that could be impacting the total sales found within the time series data information. He contributed by following his analysis of the questions based and did a deep dive into the factors that could be effecting a store's sales from a macro and seasonal level. Heber also analyzed the information from the qualitative data set. Heber contributed in reviewing and compiling the document.Brought in external data sources for the temperature and oil information used throughout the EDA.
Justin Hamilton: Justin conducted his analysis focusing on the qualitative data set. His analysis focused on how each store could potentially perform based around what that store had to offer. Justin conducted a survey to see how Maverik stores could compare to one another based on a composite score of how a customer would be attracted to each location based on their preferences. Justin contributed in compiling the document.